我有两个目录,我想从中读取它们的文本文件并给它们贴上标签,但我不知道如何通过taggedDocument来实现这一点。我以为它可以作为标记文档([strings],[labels])工作,但这显然不起作用。

from gensim import models
from gensim.models.doc2vec import TaggedDocument
import utilities as util
import os
from sklearn import svm
from nltk.tokenize import sent_tokenize
CogPath = "./FixedCog/"
NotCogPath = "./FixedNotCog/"
SamplePath ="./Sample/"
docs = []
tags = []
CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')]
NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')]
SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')]
for doc in CogList:
str = open(CogPath+doc,'r').read().decode("utf-8")
docs.append(str)
print docs
tags.append(doc)
print "###########"
print tags
print "!!!!!!!!!!!"
for doc in NotCogList:
str = open(NotCogPath+doc,'r').read().decode("utf-8")
docs.append(str)
tags.append(doc)
for doc in SampleList:
str = open(SamplePath + doc, 'r').read().decode("utf-8")
docs.append(str)
tags.append(doc) T = TaggedDocument(docs,tags) model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

错误

Traceback (most recent call last):
File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module>
model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)
File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__
self.build_vocab(documents, trim_rule=trim_rule)
File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab
if isinstance(document.words, string_types):
AttributeError: 'list' object has no attribute 'words'

所以我只是做了一些测试,在Github上发现了这一点:

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
"""
A single document, made up of `words` (a list of unicode string tokens)
and `tags` (a list of tokens). Tags may be one or more unicode string
tokens, but typical practice (which will also be most memory-efficient) is
for the tags list to include a unique integer id as the only tag. Replaces "sentence as a list of words" from Word2Vec.

因此,我决定通过为每个文档生成一个taggedDocument类来更改使用taggedDocument函数的方式,重要的是必须将标记作为列表传递。

for doc in CogList:
str = open(CogPath+doc,'r').read().decode("utf-8")
str_list = str.split()
T = TaggedDocument(str_list,[doc])
docs.append(T)

doc2vec模型的输入应该是taggeddocument的列表(['list'、'of'、'word']、[tag_])。一个好的实践是使用句子的索引作为标记。例如,用两个句子(即文档、段落)训练doc2vec模型:

s1 = 'the quick fox brown fox jumps over the lazy dog'
s1_tag = ''
s2 = 'i want to burn a zero-day'
s2_tag = '' docs = []
docs.append(TaggedDocument(words=s1.split(), tags=[s1_tag])
docs.append(TaggedDocument(words=s2.split(), tags=[s2_tag]) model = gensim.models.Doc2Vec(vector_size=300, window=5, min_count=5, workers=4, epochs=20)
model.build_vocab(docs) print 'Start training process...'
model.train(docs, total_examples=model.corpus_count, epochs=model.iter) #save model
model.save(model_path)

您可以使用Gensim的常用文本作为示例:

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

最新文章

  1. Destroying The Graph(poj 2125)
  2. HTML label标签的一点理解
  3. Servlet监听器
  4. 计算机视觉中的词袋模型(Bow,Bag-of-words)
  5. 2016huasacm暑假集训训练三 G - 还是畅通工程
  6. apache ab压力测试报错(apr_socket_recv: Connection reset by peer (104))
  7. node.js在windows下的学习笔记(9)---文件I/O模块
  8. Install GTK in Ubuntu
  9. spm3 基本
  10. iOS9如何隐藏各种bar
  11. 【翻译】《向“弹跳球”演示程序添加新功能》 in MDN
  12. 浅析Linux内核调度
  13. laravel5.5 延时队列的使用
  14. mysql 主从库同步
  15. Mysql 用户和权限
  16. 【BZOJ4316】小C的独立集(动态规划)
  17. BFS+二进制状态压缩 hdu-1429
  18. WebLogic使用总结(二)——WebLogic卸载
  19. js 数组清空 方法 汇总
  20. 低版本的linux系统装samba服务器

热门文章

  1. Database基础(二):MySQL索引创建与删除、 MySQL存储引擎的配置
  2. 75 OpenCV编译、图像处理等
  3. HTTP协议-Headers
  4. Source Insight下载及注册码
  5. java获取字符串编码和转换字符串编码
  6. 解决oracle v$sqlarea sql不完整
  7. 驱动中PAGED_CODE的作用
  8. C. Ancient Berland Circus(三点确定最小多边形)
  9. upc组队赛5 Assembly Required【思维】
  10. QT的一些小知识