gensim中TaggedDocument 怎么使用

我有两个目录，我想从中读取它们的文本文件并给它们贴上标签，但我不知道如何通过taggedDocument来实现这一点。我以为它可以作为标记文档（[strings]，[labels]）工作，但这显然不起作用。

from gensim import models

from gensim.models.doc2vec import TaggedDocument

import utilities as util

import os

from sklearn import svm

from nltk.tokenize import sent_tokenize

CogPath = "./FixedCog/"

NotCogPath = "./FixedNotCog/"

SamplePath ="./Sample/"

docs = []

tags = []

CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')]

NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')]

SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')]

for doc in CogList:

     str = open(CogPath+doc,'r').read().decode("utf-8")

     docs.append(str)

     print docs

     tags.append(doc)

     print "###########"

     print tags

     print "!!!!!!!!!!!"

for doc in NotCogList:

     str = open(NotCogPath+doc,'r').read().decode("utf-8")

     docs.append(str)

     tags.append(doc)

for doc in SampleList:

     str = open(SamplePath + doc, 'r').read().decode("utf-8")

     docs.append(str)

     tags.append(doc)

T = TaggedDocument(docs,tags)

model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

错误

Traceback (most recent call last):

  File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module>

    model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__

    self.build_vocab(documents, trim_rule=trim_rule)

  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab

    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey

  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab

    if isinstance(document.words, string_types):

AttributeError: 'list' object has no attribute 'words'

所以我只是做了一些测试，在Github上发现了这一点：

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):

    """

    A single document, made up of `words` (a list of unicode string tokens)

    and `tags` (a list of tokens). Tags may be one or more unicode string

    tokens, but typical practice (which will also be most memory-efficient) is

    for the tags list to include a unique integer id as the only tag.

    Replaces "sentence as a list of words" from Word2Vec.

因此，我决定通过为每个文档生成一个taggedDocument类来更改使用taggedDocument函数的方式，重要的是必须将标记作为列表传递。

for doc in CogList:

     str = open(CogPath+doc,'r').read().decode("utf-8")

     str_list = str.split()

     T = TaggedDocument(str_list,[doc])

     docs.append(T)

doc2vec模型的输入应该是taggeddocument的列表（['list'、'of'、'word']、[tag_]）。一个好的实践是使用句子的索引作为标记。例如，用两个句子（即文档、段落）训练doc2vec模型：

s1 = 'the quick fox brown fox jumps over the lazy dog'

s1_tag = ''

s2 = 'i want to burn a zero-day'

s2_tag = ''

docs = []

docs.append(TaggedDocument(words=s1.split(), tags=[s1_tag])

docs.append(TaggedDocument(words=s2.split(), tags=[s2_tag])

model = gensim.models.Doc2Vec(vector_size=300, window=5, min_count=5, workers=4, epochs=20)

model.build_vocab(docs)

print 'Start training process...'

model.train(docs, total_examples=model.corpus_count, epochs=model.iter)

#save model

model.save(model_path)

您可以使用Gensim的常用文本作为示例：

from gensim.test.utils import common_texts

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]

model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

巴特西

gensim中TaggedDocument 怎么使用

最新文章

热门文章