作业任务:

使用98年人民日报语料库进行中文分词训练及测试。

作业输入:

98年人民日报语料库(1998-01-105-带音.txt),用80%的数据作为训练集,20%的数据作为验证集。

运行环境:

Jupyter Notebook, Python3

作业方法:

实现了前向匹配算法的分词功能。

源码地址:

https://github.com/YanqiangWang/NLP-Summer-Course

作业步骤:

1.处理语料库: 删除段前标号,以及词性标注。

# 读取原始语料文件
in_path = '1998-01-105-带音.txt'
file = open(in_path, encoding='gbk')
in_data = file.readlines()
# 预处理后的语料库
curpus_path = 'curpus.txt'
curpusfile = open(curpus_path, 'w', encoding='utf-8')
#删除段前标号,[],{},词性标注(最短匹配)
for sentence in in_data:
words = sentence.strip().split(' ')
words.pop(0) for word in words:
if word.strip() != '':
if word.startswith('['):
word = word[1:]
elif ']' in word:
word = word[0:word.index(']')] if '{' in word:
word = word[0:word.index('{')] w_c = word.split('/')
# 生成语料库
curpusfile.write(w_c[0] + ' ') curpusfile.write('\n')

2.随机划分训练集80%和验证集20%。

from sklearn.model_selection import train_test_split

# 随机划分
curpus = open(curpus_path, encoding='utf-8').readlines()
train_data, test_data = train_test_split(
curpus, test_size=0.2, random_state=10)
# 查看划分后的数据大小
print(len(curpus))
print(len(train_data) / len(curpus))
print(len(test_data) / len(curpus))
22787
0.7999736691973494
0.20002633080265064

3.前向匹配算法FMM的实现。

# 生成词典
from tqdm import tqdm_notebook dic = [] for sentence in tqdm_notebook(train_data):
words = sentence.strip().split(' ')
for word in words:
if word.strip() != '':
if word not in dic:
dic.append(word)
# 设置单词最大长度
max_dic_len = 5
# 生成分词测试文本
test_text = []
for sentence in test_data:
words = sentence.strip().split(' ')
test_text.append(''.join(words))
# 保存验证集
test_path = 'test.txt'
testfile = open(test_path, 'w', encoding='utf-8')
for sentence in test_data:
testfile.write(sentence)
# 保存分词结果
result_path = 'result.txt'
resultfile = open(result_path, 'w', encoding='utf-8')
# 前向匹配
for sentence in tqdm_notebook(test_text):
sent = sentence
words = []
max_len = max_dic_len
while(len(sent) > 0):
word_len = max_len
for i in range(0, max_len):
word = sent[0:word_len]
if word_len == 1 or word in dic:
sent = sent[word_len:]
words.append(word)
word = []
break
else:
word_len -= 1
word = []
resultfile.write(' '.join(words) + '\n')

性能评价

查准率,查全率,F度量

Precision = (Number of words correctly segmented) / (Number of words segmented) * 100%

Recall = (Number of words correctly segmented) / (Number of words in the reference) * 100%

F measure = 2 * P * R / (P + R)

def get_word(path):
f = open(path, 'r', encoding='utf-8')
lines = f.readlines()
return lines result_lines = get_word(result_path)
test_lines = get_word(test_path) list_num = len(test_lines) if len(test_lines) < len(result_lines) else len(result_lines)
right_num = 0
result_cnt = 0
test_cnt = 0 for i in tqdm_notebook(range(list_num)):
result_sent = list(result_lines[i].split())
test_sent = list(test_lines[i].split()) result_cnt += len(result_sent)
test_cnt += len(test_sent) str_result = ''
str_test = '' i_result = 0
i_test = 0 while i_result < len(result_sent) and i_test < len(test_sent):
word_result = result_sent[i_result]
word_test = test_sent[i_test] str_result += word_result
str_test += word_test if word_result == word_test:
right_num += 1
i_result += 1
i_test += 1 else:
while len(str_result) > len(str_test):
i_test += 1
if i_test >= len(test_sent):
break
str_test += test_sent[i_test] while len(str_result) < len(str_test):
i_result += 1
if i_result >= len(result_sent):
break
str_result += result_sent[i_result] i_test += 1
i_result += 1
print("生成结果词的个数:", result_cnt)
print("验证集结果词个数:", test_cnt) p = right_num / result_cnt
r = right_num / test_cnt
f = 2 * p * r / (p + r) print("查准率:", p)
print("查全率:", r)
print("F度量:", f)
生成结果词的个数: 227640
验证集结果词个数: 219680
查准率: 0.8301748374626603
查全率: 0.8602558266569555
F度量: 0.8449476884556917

最新文章

  1. 1.2Web API 2中的Action返回值
  2. [问题2014A12] 解答
  3. ORACLE Instant Client 配置
  4. Go安装
  5. 为IE单独写CSS的三种方法
  6. @Override在JDK1.5和JDK1.6中用法区别
  7. 我忽略了的DOCTYPE!
  8. YYModel 源码历险记 代码结构
  9. traceroute命令
  10. FRP 浅析
  11. twisted学习之reactor
  12. python/MySQL(索引、执行计划、BDA、分页)
  13. Bootstrap3 代码-代码块
  14. sql语句基本查询操作
  15. 微信小程序记账本进度一
  16. 2019.02.11 bzoj3165: [Heoi2013]Segment(线段树)
  17. JSP开发Web应用系统
  18. 关于MSCOMM.OCX无法正常注册的问题解决
  19. Python(线程进程3)
  20. js如何完整的显示较长的数字

热门文章

  1. 70. Climbing Stairs QuestionEditorial Solution
  2. wordpress 如何防止盗链
  3. 编程思想(POP,OOP,SOA,AOP)
  4. apache 访问状态 分析
  5. pikachu-越权漏洞(Over Permission)
  6. 在虚拟机中使用NetToPLCSim和PLC相连.
  7. CommunityServer的编译
  8. nginx基础(一)
  9. std::sort为什么保证严格弱序?
  10. js中(function(){})()的写法用处