https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf

【可测空间  convert the data (homeworks, webpages, emails) into an object in an abstract space that we know how to measure distance 】

We will study how to define the distance between sets, specifically with the Jaccard distance. To illustrate and motivate this study, we will focus on using Jaccard distance to measure the distance between documents. This uses the common “bag of words” model, which is simplistic, but is sufficient for many applications. We start with some big questions. This lecture will only begin to answer them. • Given two homework assignments (reports) how can a computer detect if one is likely to have been plagiarized from the other without understanding the content? • In trying to index webpages, how does Google avoid listing duplicates or mirrors? • How does a computer quickly understand emails, for either detecting spam or placing effective advertisers? (If an ad worked on one email, how can we determine which others are similar?)

【词带将文本段落转化为数值集合 convert documents into sets】

4.2 Documents to Sets How do we apply this set machinery to documents? Bag of words vs. Shingles The first option is the bag of words model, where each document is treated as an unordered set of words. A more general approach is to shingle the document. This takes consecutive words and group them as a single object. A k-shingle is a consecutive set of k words. So the set of all 1-shingles is exactly the bag of words model. An alternative name to k-shingle is an k-gram. These mean the same thing. D1 : I am Sam. D2 : Sam I am. D3 : I do not like green eggs and ham. D4 : I do not like them, Sam I am. The (k = 1)-shingles of D1∪D2∪D3∪D4 are: {[I], [am], [Sam], [do], [not], [like], [green], [eggs], [and], [ham], [them]}.

The (k = 2)-shingles of D1∪D2∪D3∪D4 are: {[I am], [am Sam], [Sam Sam], [Sam I], [am I], [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham], [like them], [them Sam]}. The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

【勘误--k n n-k+1  空间复杂度 space O(kn) 】

【Jaccard 对相似度的度量 Jaccard with Shingles】

4.3 Jaccard with Shingles So how do we put this together. Consider the (k = 2)-shingles for each D1, D2, D3, and D4: D1 : [I am], [am Sam] D2 : [Sam I], [I am] D3 : [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham] D4 : [I do], [do not], [not like], [like them], [them Sam], [Sam I], [I am]

Now the Jaccard similarity is as follows: JS(D1, D2) = 1/3 ≈ 0.333 JS(D1, D3) = 0 = 0.0 JS(D1, D4) = 1/8 = 0.125 JS(D2, D3) = 0 = 0.0 JS(D3, D4) = 2/7 ≈ 0.286 JS(D3, D4) = 3/11 ≈ 0.273 Next time we will see how to use this special abstract structure of sets to compute this distance (approximately) very efficiently and at extremely large scale.

最新文章

  1. IIS 配置MVC项目路由中以api结尾的接口
  2. Uncaught ReferenceError: WebForm_DoPostBackWithOptions is not defined
  3. python 学习笔记十四 jQuery案例详解(进阶篇)
  4. Linux_06------Linux的磁盘管理
  5. Android Studio签名打包的两种方式
  6. 【Bootstrap基础学习】05 Bootstrap学习总结
  7. 2、IValueConverter应用
  8. emacs 快捷键笔记
  9. .NET中的Newtonsoft.Json.JsonConvert.SerializeObject(string a)
  10. [小技巧][ASP.Net MVC Hack] 使用 HTTP 报文中的 Header 字段进行身份验证
  11. WPF学习(8)数据绑定
  12. ABP
  13. cdq分治(hdu 5618 Jam's problem again[陌上花开]、CQOI 2011 动态逆序对、hdu 4742 Pinball Game、hdu 4456 Crowd、[HEOI2016/TJOI2016]序列、[NOI2007]货币兑换 )
  14. c++基本数据类型及其取值范围
  15. quartz集群分布式(并发)部署解决方案
  16. p中不能包含div
  17. Arria10收发器校正
  18. markdown简单常用语法
  19. JAVA中Set集合--HashSet的使用
  20. 【oneday_onepage】——美国主食吃什么

热门文章

  1. JMeter 中Random 随机函数的使用
  2. 洛谷——1968 美元汇率(DP)
  3. SQLite中使用全文搜索FTS
  4. luogu P1854 花店橱窗布置
  5. php 技术知识点汇总
  6. SQLServer出现不允许保存更改的问题解决
  7. 调整type="file"时的input的
  8. Android性能优化第(一)篇---基本概念
  9. C语言对文件的读写操作以及处理CSV文件的方法
  10. UNIX网络编程卷1 server程序设计范式8 预先创建线程,由主线程调用accept