原文链接:http://www.one2know.cn/nlp23/

  • N元模型

    预测要输入的连续词,比如



    如果抽取两个连续的词汇,则称之为二元模型
  • 准备工作

    数据集使用 Alice in Wonderland

    将初始数据提取N-grams
import nltk
import string with open('alice_in_wonderland.txt', 'r') as content_file:
content = content_file.read()
content2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in content]).split())
tokens = nltk.word_tokenize(content2)
tokens = [word.lower() for word in tokens if len(word)>=2] N = 3
quads = list(nltk.ngrams(tokens,N))
"""
Return the ngrams generated from a sequence of items, as an iterator.
For example:
>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
"""
newl_app = []
for ln in quads:
new1 = ' '.join(ln)
newl_app.append(new1)
print(newl_app[:3])

输出:

['alice adventures in', 'adventures in wonderland', 'in wonderland alice']
  • 如何实现

    1.预处理:词转换为词向量

    2.创建模型和验证:将输入映射到输出的收敛-发散模型(convergent-divergent)

    3.预测:最优词预测
  • 代码
from __future__ import print_function

from sklearn.model_selection import train_test_split
import nltk
import numpy as np
import string with open('alice_in_wonderland.txt', 'r') as content_file:
content = content_file.read()
content2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in content]).split())
tokens = nltk.word_tokenize(content2)
tokens = [word.lower() for word in tokens if len(word)>=2] N = 3
quads = list(nltk.ngrams(tokens,N))
"""
Return the ngrams generated from a sequence of items, as an iterator.
For example:
>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
"""
newl_app = []
for ln in quads:
new1 = ' '.join(ln)
newl_app.append(new1)
# print(newl_app[:3]) # 将单词向量化
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() # 词=>词向量
"""
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray()) # doctest: +NORMALIZE_WHITESPACE
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
""" x_trigm = []
y_trigm = [] for l in newl_app:
x_str = " ".join(l.split()[0:N-1])
y_str = l.split()[N-1]
x_trigm.append(x_str)
y_trigm.append(y_str) x_trigm_check = vectorizer.fit_transform(x_trigm).todense()
y_trigm_check = vectorizer.fit_transform(y_trigm).todense() # Dictionaries from word to integer and integer to word
dictnry = vectorizer.vocabulary_
rev_dictnry = {v:k for k,v in dictnry.items()} X = np.array(x_trigm_check)
Y = np.array(y_trigm_check) Xtrain, Xtest, Ytrain, Ytest,xtrain_tg,xtest_tg = train_test_split(X, Y,x_trigm, test_size=0.3,random_state=1) print("X Train shape",Xtrain.shape, "Y Train shape" , Ytrain.shape)
print("X Test shape",Xtest.shape, "Y Test shape" , Ytest.shape) # Model Building
from keras.layers import Input,Dense,Dropout
from keras.models import Model np.random.seed(1) BATCH_SIZE = 128
NUM_EPOCHS = 20 input_layer = Input(shape = (Xtrain.shape[1],),name="input")
first_layer = Dense(1000,activation='relu',name = "first")(input_layer)
first_dropout = Dropout(0.5,name="firstdout")(first_layer) second_layer = Dense(800,activation='relu',name="second")(first_dropout) third_layer = Dense(1000,activation='relu',name="third")(second_layer)
third_dropout = Dropout(0.5,name="thirdout")(third_layer) fourth_layer = Dense(Ytrain.shape[1],activation='softmax',name = "fourth")(third_dropout) history = Model(input_layer,fourth_layer)
history.compile(optimizer = "adam",loss="categorical_crossentropy",metrics=["accuracy"]) print (history.summary()) # Model Training
history.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,epochs=NUM_EPOCHS, verbose=1,validation_split = 0.2) # Model Prediction
Y_pred = history.predict(Xtest) # 测试
print ("Prior bigram words","|Actual","|Predicted","\n")
import random
NUM_DISPLAY = 10
for i in random.sample(range(len(xtest_tg)),NUM_DISPLAY):
print (i,xtest_tg[i],"|",rev_dictnry[np.argmax(Ytest[i])],"|",rev_dictnry[np.argmax(Y_pred[i])])

输出:

X Train shape (17947, 2559) Y Train shape (17947, 2559)
X Test shape (7692, 2559) Y Test shape (7692, 2559) _________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) (None, 2559) 0
_________________________________________________________________
first (Dense) (None, 1000) 2560000
_________________________________________________________________
firstdout (Dropout) (None, 1000) 0
_________________________________________________________________
second (Dense) (None, 800) 800800
_________________________________________________________________
third (Dense) (None, 1000) 801000
_________________________________________________________________
thirdout (Dropout) (None, 1000) 0
_________________________________________________________________
fourth (Dense) (None, 2559) 2561559
=================================================================
Total params: 6,723,359
Trainable params: 6,723,359
Non-trainable params: 0
_________________________________________________________________
None Prior bigram words |Actual |Predicted
595 words don | fit | know
3816 in tone | of | of
5792 queen had | only | been
2757 who seemed | to | to
5393 her and | she | she
4197 heard of | one | its
2464 sneeze were | the | of
1590 done with | said | whiting
3039 and most | things | of
4226 the queen | of | said

训练结果不好,因为单词向量维度太大,为2559,相对而言总数据集的单词太少;除了单词预测,还可以字符预测,每碰到一个空格算一个单词

最新文章

  1. Null value was assigned to a property of primitive type setter of
  2. [转]JAVA程序执行顺序,你了解了吗:JAVA中执行顺序,JAVA中赋值顺序
  3. 【转】WPF 窗体淡入淡出动画
  4. 一步步学敏捷开发:6、Scrum的3种工件
  5. 通过Maven搭建Mybatis项目
  6. 纯css实现下拉菜单
  7. Kernel Regression from Nando's Deep Learning lecture 5
  8. 剑指offer-面试题18.树的子结构
  9. PCB板常用知识简介——沉金板VS镀金板
  10. (转)CentoS 下报的 Requires: perl(:MODULE_COMPAT_5.8.8)
  11. JS简单实现自定义右键菜单
  12. Fiddler模拟重发请求
  13. 我的Android进阶之旅------> Android在TextView中显示图片方法
  14. GoldenGate实时投递数据到大数据平台(1)-MongoDB
  15. SSL 链接安全协议的enum
  16. selenium安装浏览器驱动
  17. DPDK flow_filtering 源码阅读
  18. fzyzojP3580 -- [校内训练-互测20180315]小基的高智商测试
  19. iOS自定义从底部弹上来的View
  20. 浅析iOS tableview的selectRowAtIndexPath选中无效(默认选中cell无效)

热门文章

  1. jsp数据交互(二).1
  2. istio使用教程
  3. python 实现两个文本文件内容去重
  4. Fragment 使用详解
  5. Button 使用详解
  6. 设计模式:与SpringMVC底层息息相关的适配器模式
  7. MobaXterm:远程终端登录软件封神选手
  8. django drf框架中的user验证以及JWT拓展的介绍
  9. js 数组对象深拷贝
  10. echarts3.x遇到的坑