Package Contents
To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by a single space. If your corpus has multiple documents, simply concatenate documents together with a single space. If your documents are particularly short, it's possible that padding the gap between documents with e.g. "dummy" words will produce better vectors. Once you create your corpus, you can train GloVe vectors using the following tools. An example is included in, which you can modify as necessary. This four main tools in this package are: ) vocab_count
This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count. ) cooccur
Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by vocab_count, and may specify a variety of parameters, as described by running ./build/cooccur. ) shuffle
Shuffles the binary file of cooccurrence statistics produced by cooccur. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running ./build/shuffle. ) glove
Train the GloVe model on the specified cooccurrence data, which typically will be the output of the shuffle tool. The user should supply a vocabulary file, as given by vocab_count, and may specify a number of other parameters, which are described by running ./build/glove.

() vocab_count
这个工具要求输入的语料库已经是以空格分隔的标准格式。它会首先使用类似Stanford Tokenizer 的方式作用在文本上,它会对语料库中的一元词进行统计计数,并根据总词汇量或者最小词频计数来选择阈值得到最终结果
从语聊库构建词-词共生统计,用户应该提供一个由vocab_count得到的词汇表文件,同时需要指定一系列参数, 就像运行./build/cooccur时显示的描述样
混洗由cooccur生成二进制的共生统计结果文件。对于大文件,每个块都会在混合并混洗在一起然后存储并排列在磁盘阵列上。用户需要指定一些参数,如运行 ./build/shuffle时显示的那样。 () glove 在指定的共生数据上训练glove模型,这通常是混洗工具(shuffle)输出的结果。用户应该提供一个由vocab_count得出的文件并指定一系列参数,如运行./build/glove描述的那样


