Comparison of FastText and Word2Vec

 

Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.

 

Download data

In [ ]:
import nltk
nltk.download()
# Only the brown corpus is needed in case you don't have it.
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training # Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
for word in nltk.corpus.brown.words():
f.write('{word} '.format(word=word))
In [ ]:
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
!wget http://mattmahoney.net/dc/text8.zip
In [ ]:
# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
 

Train models

 

If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -

In [ ]:
!./fasttext skipgram -input brown_corp.txt -output brown_ft
!./fasttext skipgram -input text8.txt -output text8_ft
 

For training the gensim models -

In [ ]:
from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO) MODELS_DIR = 'models/' brown_gs = Word2Vec(brown.sents())
brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec') text8_gs = Word2Vec(Text8Corpus('text8'))
text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
 

Download models

In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -

In [ ]:
# download the fastText and gensim models trained on the brown corpus and text8 corpus
!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
 

Once you have downloaded or trained the models (make sure they're in the models/ directory, or that you've appropriately changed MODELS_DIR) and downloaded questions-words.txt, you're ready to run the comparison.

 

Comparisons

In [1]:
from gensim.models import Word2Vec

def print_accuracy(model, questions_file):
print('Evaluating...\n')
acc = model.accuracy(questions_file)
for section in acc:
correct = len(section['correct'])
total = len(section['correct']) + len(section['incorrect'])
total = total if total else 1
accuracy = 100*float(correct)/total
print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))
sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total)) syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total)) MODELS_DIR = 'models/' word_analogies_file = 'questions-words.txt'
print('\nLoading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file) print('\nLoading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
 
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
27/182, 14.84%, Section: family
539/702, 76.78%, Section: gram1-adjective-to-adverb
106/132, 80.30%, Section: gram2-opposite
656/1056, 62.12%, Section: gram3-comparative
136/210, 64.76%, Section: gram4-superlative
439/650, 67.54%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
165/1260, 13.10%, Section: gram7-past-tense
327/552, 59.24%, Section: gram8-plural
245/342, 71.64%, Section: gram9-plural-verbs
2640/5086, 51.91%, Section: total Semantic: 27/182, Accuracy: 14.84%
Syntactic: 2613/4904, Accuracy: 53.28% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
53/182, 29.12%, Section: family
8/702, 1.14%, Section: gram1-adjective-to-adverb
0/132, 0.00%, Section: gram2-opposite
75/1056, 7.10%, Section: gram3-comparative
0/210, 0.00%, Section: gram4-superlative
16/650, 2.46%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
30/1260, 2.38%, Section: gram7-past-tense
4/552, 0.72%, Section: gram8-plural
8/342, 2.34%, Section: gram9-plural-verbs
194/5086, 3.81%, Section: total Semantic: 53/182, Accuracy: 29.12%
Syntactic: 141/4904, Accuracy: 2.88%
 

Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.

Let me explain that better.

According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently would include information from both character n-grams apparent and ly (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.

Example analogy:

amazing amazingly calm calmly

This analogy is marked correct if:

embedding(amazing) - embedding(amazingly) = embedding(calm) - embedding(calmly)

Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.

A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).

Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.

In [2]:
print('Loading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file) print('Loading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
 
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 298/506, 58.89%, Section: capital-common-countries
625/1452, 43.04%, Section: capital-world
37/268, 13.81%, Section: currency
291/1511, 19.26%, Section: city-in-state
151/306, 49.35%, Section: family
567/756, 75.00%, Section: gram1-adjective-to-adverb
188/306, 61.44%, Section: gram2-opposite
809/1260, 64.21%, Section: gram3-comparative
303/506, 59.88%, Section: gram4-superlative
528/992, 53.23%, Section: gram5-present-participle
1291/1371, 94.16%, Section: gram6-nationality-adjective
451/1332, 33.86%, Section: gram7-past-tense
853/992, 85.99%, Section: gram8-plural
360/650, 55.38%, Section: gram9-plural-verbs
6752/12208, 55.31%, Section: total Semantic: 1402/4043, Accuracy: 34.68%
Syntactic: 5350/8165, Accuracy: 65.52% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 138/506, 27.27%, Section: capital-common-countries
248/1452, 17.08%, Section: capital-world
28/268, 10.45%, Section: currency
158/1571, 10.06%, Section: city-in-state
227/306, 74.18%, Section: family
85/756, 11.24%, Section: gram1-adjective-to-adverb
54/306, 17.65%, Section: gram2-opposite
739/1260, 58.65%, Section: gram3-comparative
178/506, 35.18%, Section: gram4-superlative
297/992, 29.94%, Section: gram5-present-participle
718/1371, 52.37%, Section: gram6-nationality-adjective
325/1332, 24.40%, Section: gram7-past-tense
389/992, 39.21%, Section: gram8-plural
200/650, 30.77%, Section: gram9-plural-verbs
3784/12268, 30.84%, Section: total Semantic: 799/4103, Accuracy: 19.47%
Syntactic: 2985/8165, Accuracy: 36.56%
 

With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec

These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.

 

References

最新文章

  1. 《Note --- Unreal --- MemPro (CONTINUE... ...)》
  2. android中xml tools属性详解
  3. 如何设置div高度为100%
  4. C#开发MySQL数据库程序时需要注意的几点
  5. 兼容IE,chrome 等所有浏览器 回到顶部代码
  6. PL/pgSQL的RETURN QUERY例子
  7. MySQL加强
  8. 解决RecyclerView无法onItemClick问题
  9. Example012点击修改属性
  10. RabbitMQ --- Publish/Subscribe(发布/订阅)
  11. Windows 自动启动 bat
  12. Twisted网络库编程实例
  13. 基于axios创建的实例使用axios.all,报错:this.$http is not a function,但请求成功
  14. c# 之 System.Type.GetType()与Object.GetType()与typeof比较
  15. [LightOJ 1370] Bi-shoe and Phi-shoe(欧拉函数快速筛法)
  16. Shiro集成web环境[Springboot]-基础使用
  17. ECCV 2018 | 给Cycle-GAN加上时间约束,CMU等提出新型视频转换方法Recycle-GAN
  18. hdu1081 To The Max 2016-09-11 10:06 29人阅读 评论(0) 收藏
  19. BootStrap启动类
  20. DAVY的神龙帕夫——读者的心灵故事|十二橄榄枝的传说

热门文章

  1. C/C++ 16进制转字符串,字符串转16进制 EX
  2. 【Codeforces Round #429 (Div. 2) C】Leha and Function
  3. list 链表
  4. delphi 文件操作(信息获取)
  5. NX二次开发-Block UI C++界面Body Collector(体收集器)控件的获取(持续补充)
  6. [Catalan数三连]网格&有趣的数列&树屋阶梯
  7. 虚拟机安装(Cent OS)
  8. C++的ofstream与ifstream使用
  9. Java学习之集合(List接口)
  10. 前端(十三)—— JavaScript高级:回调函数、闭包、循环绑定、面向对象、定时器