A part of Natural Language Processing (NLP) is processing text by “tokenizing” language strings. This means we can break up a string of text into parts by word, sentence, etc. In this lesson, we will use the natural library to tokenize a string. First, we will break the string into words using WordTokenizerWordPunctTokenizer, and TreebankWordTokenizer. Then we will break the string into sentences using RegexpTokenizer.

var natural = require('natural'),
tokenizer = new natural.WordTokenizer();
console.log(tokenizer.tokenize("your dog has fleas."));
// [ 'your', 'dog', 'has', 'fleas' ]
tokenizer = new natural.TreebankWordTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my', 'dog', 'has', 'n\'t', 'any', 'fleas', '.' ] tokenizer = new natural.RegexpTokenizer({pattern: /\-/});
console.log(tokenizer.tokenize("flea-dog"));
// [ 'flea', 'dog' ] tokenizer = new natural.WordPunctTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my', 'dog', 'hasn', '\'', 't', 'any', 'fleas', '.' ]

最新文章

  1. [Keras] Develop Neural Network With Keras Step-By-Step
  2. IntelliJ IDEA 14.x 快捷键/个性化设置
  3. 【源码】c#编写的安卓客户端与Windows服务器程序进行网络通信
  4. Html笔记(十)XHTML XML
  5. Linux学习之sudo命令
  6. lesson - 1 - IP /DNS /cat !$ /putty 知识扩充
  7. django restful 1-在线Python编辑器
  8. Oracle查询语句导致CPU使用率过高问题处理
  9. JavaScript是如何工作的:Web Workers的构建块 + 5个使用他们的场景
  10. 【JVM.1】java内存区域与内存溢出
  11. 使用awk处理文本
  12. 002.KVM环境部署
  13. POJ 2456 Aggressive cows(二分答案)
  14. 在web.xml中配置404错误拦截
  15. acm省赛选拔组队赛经验谈
  16. C# GetType和typeof的区别
  17. 前端跨域问题相关知识详解(原生js和jquery两种方法实现jsonp跨域)
  18. HTML5-IOS WEB APP应用程序(IOS META)
  19. 跟我一起学kafka(一)
  20. 结合BeautifulSoup和hackhttp的爬虫实例

热门文章

  1. C/C++(基础-常量,类型转换)
  2. mysql 编码错误修改
  3. IntelliJ IDEA 中如何配置多个jdk版本即(1.7和1.8两个jdk都可用)
  4. [RxJS] Marbles Testings
  5. spark internal - 作业调度
  6. jquery的图片轮播 模板类型
  7. js18--继承方式
  8. IAR FOR STM8 学习笔记 IAR工程的建立
  9. 43.c++指针类型转换
  10. 65.十一级指针实现百万qq号的增删查改以及排序写入