原文链接:http://www.one2know.cn/nlp17/

  • 数据集

    scikit-learn中20个新闻组,总邮件18846,训练集11314,测试集7532,类别20
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
x_train = newsgroups_train.data
x_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
print('List of all 20 categories:')
print(newsgroups_train.target_names,'\n')
print('Sample Email:')
print(x_train[0])
print('Sample Target Category:')
print(y_train[0])
print(newsgroups_train.target_names[y_train[0]])

输出:

List of all 20 categories:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] Sample Email:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail. Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
  • 实现步骤
  1. 预处理

    1)去标点符号

    2)分词

    3)单词都转化成小写

    4)去停用词

    5)保留长度至少为3的词

    6)提取词干

    7)词性标注

    8)词形还原
  2. TF-IDF向量转换
  3. 深度学习模型的训练和测试
  4. 模型评估和结果分析
  • 代码
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
x_train = newsgroups_train.data
x_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# print('List of all 20 categories:')
# print(newsgroups_train.target_names,'\n')
# print('Sample Email:')
# print(x_train[0])
# print('Sample Target Category:')
# print(y_train[0])
# print(newsgroups_train.target_names[y_train[0]]) import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import pandas as pd
from nltk import pos_tag
from nltk.stem import PorterStemmer def preprocessing(text):
# 标点都换成空格,再以空格分割,在以空格为分割合并所以元素
text2 = ' '.join(''.join([' ' if ch in string.punctuation else ch for ch in text]).split())
# 分词
tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]
tokens = [word.lower() for word in tokens]
stopwds = stopwords.words('english')
# 过滤掉 停用词 和 长度<3 的token
tokens = [token for token in tokens if token not in stopwds and len(token) >= 3]
# 词干提取
stemmer = PorterStemmer()
tokens = [stemmer.stem(word) for word in tokens]
# 词性标注
tagged_corpus = pos_tag(tokens)
Noun_tags = ['NN','NNP','NNPS','NNS'] # 普通名词 专有名词 专有名词复数 普通名词复数
Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
# 动词 动词过去式 动词现在分词 动词过去分词 动词现在时 动词现在时第三人称单数
lemmatizer = WordNetLemmatizer()
def prat_lemmatize(token,tag):
if tag in Noun_tags:
return lemmatizer.lemmatize(token,'n')
elif tag in Verb_tags:
return lemmatizer.lemmatize(token,'v')
else:
return lemmatizer.lemmatize(token,'n')
pre_proc_text = ' '.join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])
return pre_proc_text # 处理数据集
x_train_preprocessed = []
for i in x_train:
x_train_preprocessed.append(preprocessing(i))
x_test_preprocessed = []
for i in x_test:
x_test_preprocessed.append(preprocessing(i)) # 得到每个文档的TF-IDF向量
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1,2),stop_words='english',
max_features=10000,strip_accents='unicode',norm='l2')
x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense() # 稀疏矩阵=>密集!?
x_test_2 = vectorizer.transform(x_test_preprocessed).todense() # 导入深度学习模块
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation
from keras.optimizers import Adadelta,Adam,RMSprop
from keras.utils import np_utils np.random.seed(0)
nb_classes = 20
batch_size = 64 # 批尺寸
nb_epochs = 20 # 迭代次数 # 将20个类变成one-hot编码向量
Y_train = np_utils.to_categorical(y_train,nb_classes) # 建立keras模型 3个隐藏层 神经元个数分别为1000 500 50,每层dropout均为50%,优化算法为Adam
model = Sequential()
model.add(Dense(1000,input_shape=(10000,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam')
# loss=交叉熵损失函数 optimizer优化程序=adam
print(model.summary()) # 模型训练
model.fit(x_train_2,Y_train,batch_size=batch_size,epochs=nb_epochs,verbose=1) # 模型预测
y_train_predclass = model.predict_classes(x_train_2,batch_size=batch_size)
y_test_preclass = model.predict_classes(x_test_2,batch_size==batch_size)
from sklearn.metrics import accuracy_score,classification_report
print("\n\nDeep Neural Network - Train accuracy:",round(accuracy_score(y_train,y_train_predclass),3))
print("\nDeep Neural Network - Test accuracy:",round(accuracy_score(y_test,y_test_preclass),3))
print("\nDeep Neural Network - Train Classification Report")
print(classification_report(y_train,y_train_predclass))
print("\nDeep Neural Network - Test Classification Report")
print(classification_report(y_test,y_test_preclass))

输出:

Using TensorFlow backend.
WARNING:tensorflow:From
D:\Python37\Lib\site-packages\tensorflow\python\framework\op_def_library.py:263:
colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a
future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From
D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3445: calling dropout
(from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a
future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   =================================================================
dense_1 (Dense)              (None, 1000)              10001000 
_________________________________________________________________ activation_1 (Activation)    (None, 1000)              0        
_________________________________________________________________ dropout_1 (Dropout)          (None, 1000)              0        
_________________________________________________________________ dense_2 (Dense)              (None, 500)               500500   
_________________________________________________________________ activation_2 (Activation)    (None, 500)               0        
_________________________________________________________________ dropout_2 (Dropout)          (None, 500)               0        
_________________________________________________________________ dense_3 (Dense)              (None, 50)                25050    
_________________________________________________________________ activation_3 (Activation)    (None, 50)                0        
_________________________________________________________________ dropout_3 (Dropout)          (None, 50)                0        
_________________________________________________________________ dense_4 (Dense)              (None, 20)                1020     
_________________________________________________________________
activation_4 (Activation)    (None, 20)                0  =================================================================
Total params: 10,527,570
Trainable params: 10,527,570
Non-trainable params:0
______________________________________________________________
None
WARNING:tensorflow:From
D:\Python37\Lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from
tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/20
2019-07-06 23:03:46.934966: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU
supports instructions that this TensorFlow binary was not compiled to use: AVX2    64/11314 [..............................] - ETA: 4:41 - loss: 2.9946
  128/11314 [..............................] - ETA: 2:43 - loss: 2.9948
  192/11314 [..............................] - ETA: 2:03 - loss: 2.9951
  256/11314 [..............................] - ETA: 1:43 - loss: 2.9947
  320/11314 [..............................] - ETA: 1:32 - loss: 2.9938
此处省略一堆epoch的一堆操作 Deep Neural Network - Train accuracy: 0.999
Deep Neural Network - Test accuracy: 0.811 Deep Neural Network - Train Classification Report
              precision    recall  f1-score   support            0       1.00      1.00      1.00       480
           1       1.00      0.99      1.00       584
           2       0.99      1.00      1.00       591
           3       1.00      1.00      1.00       590
           4       1.00      1.00      1.00       578
           5       1.00      1.00      1.00       593
           6       1.00      1.00      1.00       585
           7       1.00      1.00      1.00       594
           8       1.00      1.00      1.00       598
           9       1.00      1.00      1.00       597
          10       1.00      1.00      1.00       600
          11       1.00      1.00      1.00       595
          12       1.00      1.00      1.00       591
          13       1.00      1.00      1.00       594
          14       1.00      1.00      1.00       593
          15       1.00      1.00      1.00       599
          16       1.00      1.00      1.00       546
          17       1.00      1.00      1.00       564
          18       1.00      1.00      1.00       465
          19       1.00      1.00      1.00       377     accuracy                           1.00     11314
   macro avg       1.00      1.00      1.00     11314
weighted avg       1.00      1.00      1.00     11314 Deep Neural Network - Test Classification Report
              precision    recall  f1-score   support            0       0.78      0.78      0.78       319
           1       0.70      0.74      0.72       389
           2       0.68      0.69      0.68       394
           3       0.71      0.69      0.70       392
           4       0.82      0.76      0.79       385
           5       0.84      0.74      0.78       395
           6       0.73      0.87      0.80       390
           7       0.85      0.86      0.86       396
           8       0.93      0.91      0.92       398
           9       0.89      0.91      0.90       397
          10       0.96      0.97      0.96       399
          11       0.87      0.95      0.91       396
          12       0.69      0.72      0.70       393
          13       0.88      0.77      0.82       396
          14       0.83      0.92      0.87       394
          15       0.91      0.84      0.88       398
          16       0.78      0.83      0.80       364
          17       0.97      0.87      0.92       376
          18       0.74      0.66      0.70       310
          19       0.59      0.62      0.61       251     accuracy                           0.81      7532
   macro avg       0.81      0.81      0.81      7532
weighted avg       0.81      0.81      0.81      7532

最新文章

  1. Linux学期总结
  2. 中缀表达式转后缀表达式(用于求字符串表达式值)(js栈和队列的实现是通过数组的push和unshift方法插值,pop方法取值)
  3. Jsp与servlet的区别
  4. windows phone SDK 8.0 模拟器异常 0x89721800解决办法
  5. linux 下查看系统资源和负载,以及性能监控
  6. 调试CS5343总结报告
  7. LINUX内核分析第七周学习总结:可执行程序的装载
  8. ls 显示目录和文件的技巧
  9. 矩阵类c++实现
  10. Argparse4j
  11. 多输入select
  12. CentOS安装mysql源码包
  13. Java之.jdk安装-Windows
  14. springboot(整合事务和分布式事务)
  15. Docker 的 Web 管理工具 DockerFly
  16. require的shim解释
  17. [C++] 用Xcode来写C++程序[5] 函数的重载与模板
  18. samba文件共享服务配置一(共2节)
  19. luogu P2619 [国家集训队2]Tree I
  20. 实例化后的list的默认值

热门文章

  1. python基础学习(起步)
  2. SQL语句中的as
  3. 【Android】Field requires API level 4 (current min is 1): android.os.Build.VERSION#SDK_INT
  4. strus 上传文件
  5. 使用 Docker 生成 Let’s Encrypt 证书
  6. IdentityServer4笔记整理(更新中)
  7. Windows 下安装 Python + Django
  8. java8(一)Lambda表达式
  9. hadoop学习(六)----HDFS的shell操作
  10. 《机器学习技法》---对偶SVM