一般搜索的query比较短,但如果query比较长,如是一段文本,需要搜索相似的文本,这时候一般就需要wand算法,该算法在广告系统中有比较成熟的应该,主要是adsense场景,需要搜索一个页面内容的相似广告。

Wand方法简单来说,一般我们在计算文本相关性的时候,会通过倒排索引的方式进行查询,通过倒排索引已经要比全量遍历节约大量时间,但是有时候仍然很慢。
原因是很多时候我们其实只是想要top n个结果,一些结果明显较差的也进行了复杂的相关性计算,而weak-and算法通过计算每个词的贡献上限来估计文档的相关性上限,从而建立一个阈值对倒排中的结果进行减枝,从而得到提速的效果。

wand算法首先要估计每个词对相关性贡献的上限,最简单的相关性就是TF*IDF,一般query中词的TF均为1,IDF是固定的,因此就是估计一个词在文档中的词频TF上限,一般TF需要归一化,即除以文档所有词的个数,因此,就是要估算一个词在文档中所能占到的最大比例,这个线下计算即可。

知道了一个词的相关性上界值,就可以知道一个query和一个文档的相关性上限值,显然就是他们共同的词的相关性上限值的和。

这样对于一个query,获得其所有词的相关性贡献上限,然后对一个文档,看其和query中都出现的词,然后求这些词的贡献和即可,然后和一个预设值比较,如果超过预设值,则进入下一步的计算,否则则丢弃。

如果按照这样的方法计算n个最相似文档,就要取出所有的文档,每个文档作预计算,比较threshold,然后决定是否在top-n之列。这样计算当然可行,但是还是可以优化的。优化的出发点就是尽量减少预计算,wand论文中提到的算法如下:

http://wulc.me/2018/03/18/Wand%20%E7%AE%97%E6%B3%95%E4%BB%8B%E7%BB%8D%E4%B8%8E%E5%AE%9E%E7%8E%B0/

import heapq

UB = {"t0":0.5,"t1":1,"t2":2,"t3":3,"t4":4} #upper bound of term's value
LAST_ID = 999999999999 # a large number, larger than all the doc id in the inverted index
THETA = 2 # theta, threshold for chechking whether to calculate the relevence between query and doc
TOPN = 3 #max result number class WAND:
def __init__(self, InvertIndex):
"""init inverted index and necessary variable"""
self.result_list = [] #result list
self.inverted_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...
self.current_doc = 0
self.current_inverted_index = {} #posting
self.query_terms = []
self.sort_terms = []
self.threshold = THETA
self.last_id = LAST_ID def __init_query(self, query_terms):
"""init variable with query"""
self.current_doc = 0
self.current_inverted_index = {}
self.query_terms = []
self.sort_terms = [] for term in query_terms:
if term in self.inverted_index: # terms may not appear in inverted_index
doc_id = self.inverted_index[term][0]
self.query_terms.append(term)
self.current_inverted_index[term] = [doc_id, 0] #[ docid, index ]
self.sort_terms.append([doc_id, term]) def __pick_term(self, pivot_index):
"""select the term before pivot_index in sorted term list
paper recommends returning the term with max idf, here we just return the firt term,
also return the index of the term instead of the term itself for speeding up"""
return 0 def __find_pivot_term(self):
"""find pivot term"""
score = 0
for i in range(len(self.sort_terms)):
score += UB[self.sort_terms[i][1]]
if score >= self.threshold:
return [self.sort_terms[i][1], i] #[term, index]
return [None, len(self.sort_terms)] def __iterator_invert_index(self, change_term, docid, pos):
"""find the new_doc_id in the doc list of change_term such that new_doc_id >= docid,
if no new_doc_id satisfy, the self.last_id"""
doc_list = self.inverted_index[change_term]
# new_doc_id, new_pos = self.last_id, len(doc_list)-1 # the case when new_doc_id not exists
for i in range(pos, len(doc_list)):
if doc_list[i] >= docid: # since doc_list contains self.last_id, this inequation will always be satisfied
new_pos = i
new_doc_id = doc_list[i]
break
return [new_doc_id, new_pos] def __advance_term(self, change_index, doc_id ):
"""change the first doc of term self.sort_terms[change_index] in the current inverted index
return whether the action succeed or not"""
change_term = self.sort_terms[change_index][1]
pos = self.current_inverted_index[change_term][1]
new_doc_id, new_pos = self.__iterator_invert_index(change_term, doc_id, pos)
self.current_inverted_index[change_term] = [new_doc_id, new_pos]
self.sort_terms[change_index][0] = new_doc_id def __next(self):
while True:
self.sort_terms.sort() #sort terms by doc id
pivot_term, pivot_index = self.__find_pivot_term() #find pivot term > threshold
if pivot_term == None: #no more candidate
return None
pivot_doc_id = self.current_inverted_index[pivot_term][0]
if pivot_doc_id == self.last_id: # no more candidate
return None
if pivot_doc_id <= self.current_doc:
change_index = self.__pick_term(pivot_index)
self.__advance_term(change_index, self.current_doc + 1)
else:
first_doc_id = self.sort_terms[0][0]
if pivot_doc_id == first_doc_id:
self.current_doc = pivot_doc_id
return self.current_doc # return the doc for fully calculating
else:
# pick all preceding term instead of just one, then advance all of them to pivot
change_index = 0
while change_index < pivot_index:
self.__advance_term(change_index, pivot_doc_id)
change_index += 1
# print(self.sort_terms, self.current_doc, pivot_doc_id) def __insert_heap(self, doc_id, score):
"""store the Top N result"""
if len(self.result_list) < TOPN:
heapq.heappush(self.result_list, (score, doc_id))
else:
heapq.heappushpop(self.result_list, (score, doc_id)) def __calculate_doc_relevence(self, docid):
"""fully calculate relevence between doc and query"""
score = 0
for term in self.query_terms:
if docid in self.inverted_index[term]:
score += UB[term]
return score def perform_query(self, query_terms):
self.__init_query(query_terms)
while True:
candidate_docid = self.__next()
if candidate_docid == None:
break
#insert candidate_docid to heap
print('candidata doc', candidate_docid)
full_doc_score = self.__calculate_doc_relevence(candidate_docid)
self.__insert_heap(candidate_docid, full_doc_score)
print("result list ", self.result_list)
return self.result_list if __name__ == "__main__":
testIndex = {}
testIndex["t0"] = [1, 3, 26, LAST_ID]
testIndex["t1"] = [1, 2, 4, 10, 100, LAST_ID]
testIndex["t2"] = [2, 3, 6, 34, 56, LAST_ID]
testIndex["t3"] = [1, 4, 5, 23, 70, 200, LAST_ID]
testIndex["t4"] = [5, 14, 78, LAST_ID] w = WAND(testIndex)
final_result = w.perform_query(["t0", "t1", "t2", "t3", "t4"])
print("=================final result=======================")
for i in reversed(range(len(final_result))):
print("doc {0}, relevence score {1}".format(final_result[i][1], final_result[i][0]))

  

最新文章

  1. 搭建Spark的单机版集群
  2. redis 集群配置实战
  3. Grasshopper 2.0 MP Color FireWire 1394b (Sony ICX274)
  4. Use a layout_width of 0dip instead of wrap_content for better performance.......【Written By KillerLegend】
  5. 30道Linux面试题
  6. iOS数据存储之属性列表理解
  7. 我的接口框架---框架函数文件common.php
  8. JavaSE教程-04Java中循环语句for,while,do&#183;&#183;&#183;while-练习
  9. Flex中通过RadioButton进行切换
  10. 用反向代理nginx proxy_pass配置解决ie8 ajax请求被拦截问题 ie8用nginx代理实现跨域请求访问 nginx405正向代理request_uri
  11. 20175312 2018-2019-2 《Java程序设计》第1周学习总结
  12. flex 布局下,css 设置文本不换行时,省略号不显示的解决办法
  13. 关于Bagging
  14. python 获取二进制文件
  15. 禁用Visual Studio 2013的Browser Link功能 -调试不断请求http://localhost:6154/c4ad1c693ebf428283832eaa827f9c6e/arterySignalR/poll?transport=longPolling...
  16. 尚硅谷JavaSEday18 String类练习题
  17. gradle springboot 项目运行的三种方式
  18. jenkins的时间与服务器的时间不一致
  19. 点击搜索条件提交form表单
  20. java 的==和equals的区别(二)

热门文章

  1. classmethod 修饰符
  2. Vulnhub:katana靶机
  3. 关于服务器上的XML
  4. 使用vue+iview创建自己的对话框组件
  5. 使用PHP自带的过滤验证函数:Filter
  6. python 操作配置文件(configparser模块)
  7. 泛微OA技巧随记
  8. 创建一个与a.txt文件同目录下的另一个文件b.txt
  9. 【ADB命令】获取应用包名
  10. WPF 入门教程DispatcherTimer计时器