【NLP】How to Generate Embeddings?
How to represent words.
0 .
Native represtation: one-hot vectors
Demision: |all words|
(too large and hard to express senmatic similarity)
Idea:produce dense vector representations based on the context/use of words
So, there are Three main approaches:
1.
Count-based methods
(1) Define a basis vocabulary C(lower than all words dimision) of context words(expect:the、a、of…)
(2) Define a word window size W
(3) Count the basis vocabulary words occurring W words to the left or right of each instance of a target word in the corpus
(4) From a vector represtation of the target word based on these counts
Example-express:
We can calculate the similarity of two words using inner product or cosine.
For instance.
2.
Neural Embedding Models(Main Idea)
To generate an embedding matrix in R(|all words| * |context words|) which looks like:
(count based vectors)
Rows are word vectores.
We can retrieve a certain word vector with one-hot vector.
(One)generic idea behind embedding learning:
(1) Collect instances ti∈inst(t) of a word t of vocab V
(2) For each instance, collect its context word c(ti) (e.g.k-word window)
(3) Define some score function score(ti,c(ti),θ,E) with upper bound on output
(4) Define a loss
(5) Estimate:
(6) Use the estimated E as the embedding matrix
Attention:
Scoring function estimates whether a sentence(or the object word and its context) is said or used normally by a people,so the higher the score,the more likely it is.
3.
C&W
Firstly,we embed all words in a sentence with E.
Then,sentence(w1,w2,w3,w4,w5) goes through a convolution layer(maybe just simpal connection layer).
Then,it goes through a simpal MLP.
Then,it goes through the ‘scorer’layer and output the final Score.
Minimize the loss function(!),and use the parameter matrix of input layer and ..
4. Word2Vec
1) CBoW(contextual bag of words)
2) Skip-gram:
最新文章
- C++11 变长模版和完美转发实例代码
- Deci and Centi Seconds parsing in java
- EditText中输入手机号码时,自动添加空格
- 四步轻松实现用Visio画UML类图
- IOS开发之不同版本适配问题2(#ifdef __IPHONE_7_0)(转载)
- win+r 快速启动应用程序
- Unix守护进程
- OCI_ERROE - errcode[1591],errmsg[ORA-01591:
- HTML,login文本框·
- Jenkins小菜初次使用小记
- Linux 结构化命令
- 苹果电脑自带python安装tensorflow一直有问题
- 用jQuery和Json实现Ajax异步请求
- css 调转180度:transform: rotate(180deg);
- 0006-20180422-自动化第七章-python基础学习笔记
- Java基础-方法
- FI / CO 配置步骤清单
- 【Jetty】Jetty 的工作原理以及与 Tomcat 的比较
- 学习笔记13—python DataFrame获取行数、列数、索引及第几行第几列的值
- 错误:SSL peer shut down incorrectly