Bag-of-words Model

Previous state-of-the-art document representations were based on the bag-of-words model, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents
(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
are used to construct a length 10 list of words
["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"]
so then we can represent the two documents as fixed length vectors whose elements are the frequencies of the corresponding words in our list
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Bag-of-words models are surprisingly effective but still lose information about word order. Bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.


word2vec 中的数学原理详解

自己动手写word2vec (一):主要概念和流程

1、稀疏向量,又称为one-hot representation2、密集向量,又称distributed representation,即分布式表示。其实word2vec做的事情很简单,大致来说,就是构建了一个多层神经网络,然后在给定文本中获取对应的输入和输出,在训练过程中不断修正神经网络中的参数,最后得到词向量。word2vec采用的是n元语法模型(n-gram model),即假设一个词只与周围n个词有关,而与文本中的其他词无关。这种模型构建简单直接,也有后续的各种平滑方法。CBOW模型:输入是某个词A周围的n个单词的词向量之和,输出是词A本身的词向量;skip-gram模型:输入是词A本身,输出是词A周围的n个单词的词向量(要循环n遍)。

tensorflow笔记:使用tf来实现word2vec

利用Python实现中文情感极性分析


#-*- coding:utf-8 -*-
 from sklearn.datasets import fetch_20newsgroups from bs4 import BeautifulSoup import nltk import re from gensim.models import word2vec 
news = fetch_20newsgroups(subset='all')
X, y = news.data, news.target

def news_to_sentences(news):
    news_text = BeautifulSoup(news, 'html.parser').get_text()

    #分成句子
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sentences = tokenizer.tokenize(news_text)  #分成单词
    sentences = []
    for sent in raw_sentences:
        sentences.append(re.sub('[^a-zA-Z]', ' ', sent.lower().strip()).split())

    return sentences

sentences = []
for x in X:
    sentences += news_to_sentences(x)

# Set values for various parameters
num_features = 300    # Word vector dimensionality
min_word_count = 20   # Minimum word count
num_workers = 2     # Number of threads to run in parallel
context = 5        # Context window size
downsampling = 1e-3   # Downsample setting for frequent words

model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)
model.init_sims(replace=True)
print model.most_similar('morning')


最新文章

  1. App Store有哪些原因会影响app应用上架呢?(分享)
  2. LA 4287 等价性证明
  3. Ajax做分页
  4. Wps 方框里面加勾
  5. Failed to load JavaHL Library解决方法
  6. 学习练习 java面向对象梯形面积
  7. 九度oj 1349 数字在排序数组中出现的次数
  8. git上解决代码冲突
  9. gui线程
  10. Timus 1777. Anindilyakwa 奇怪的问题计数
  11. 返璞归真 asp.net mvc (10) - asp.net mvc 4.0 新特性之 Web API
  12. java 转成字符串 json 数组和迭代
  13. RDLC报表纵向合并单元格。
  14. python实战===实现读取txt每一行的操作,账号密码
  15. STM32之使用库函数驱动LED灯
  16. Week2 代码复查
  17. CentOS7源码安装qbittorrent最新版本
  18. T-SQL目录汇总1
  19. flask+socketio+echarts3 服务器监控程序(基于后端数据推送)
  20. 洛谷P3227 切糕

热门文章

  1. java static关键字使用
  2. Contiguous Array with Equal Number of 0 & 1
  3. centos7: svbversion版本的安装配置+tortoisesvn登录验证
  4. android--------微信 Tinker 热修复 (二)
  5. Tree Cutting (Hard Version) CodeForces - 1118F2 (树形DP,计数)
  6. 『PyTorch』第九弹_前馈网络简化写法
  7. thinkphp3.2导出
  8. hdu-4678-sg
  9. hpu积分赛(回溯法)
  10. IOS-整体框架类图