一种融合词语位置特征的Lucene相似度评分算法.pdf

下载文档

10
0
约 6页
2017-05-23 发布于河南
举报
版权申诉
保障服务

一种融合词语位置特征的Lucene相似度评分算法.pdf

1、本文档共6页，可阅读全部内容。
2、有哪些信誉好的足球投注网站（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

一种融合词语位置特征的Lucene相似度评分算法

Computer Engineering and Applications 计算机工程与应用 2014 ，50（2） 129 一种融合词语位置特征的Lucene相似度评分算法 1 1，2 2 白培发，王成良，徐玲 1 1 ，2 2 BAI Peifa , WANG Chengliang , XU Ling 1.重庆大学计算机学院，重庆 400030 2.重庆大学软件学院，重庆 400030 1.College of Computer Science, Chongqing University, Chongqing 400030, China 2.College of Software Engineering, Chongqing University, Chongqing 400030, China BAI Peifa, WANG Chengliang, XU Ling. Scoring algorithm of similarity based on terms ’position feature combina- tion for Lucene. Computer Engineering and Applications, 2014, 50（2）：129-132. Abstract ：The scoring algorithm of similarity is one of the core parts in Lucene. After the analysing and researching on the default scoring algorithm of Lucene similarity, this paper proposes an improved algorithm aimed at the deficiency of the Lucene ’s default algorithm which only considers the frequencies rather than the position of query terms occurrence. The improved algorithm combines the feature of the terms ’position relationship with Lucene ’s default scoring algorithm of similarity. The experiment on the TREC dataset shows that, the improved algorithm increases the value of evaluation metric MAP and P@n to a certain extent. Key words ：Lucene; similarity; full text search 摘要：相似度评分算法是Lucene 引擎中的核心部分之一。对Lucene 内部的相似度评分算法进行研究分析后，针对Lucene 只关心查询词出现的频率，而不关心它们所在的位置这一缺陷提出了一种改进的算法。改进的算法将词语位置关系特征融合到Lucene 原始相似度评分算法中。在TREC 数据集上的实验结果表明：改进后的算法与 Lucene原始算法相比，在MAP 和P@n 指标上都有一定程度的提高。关键词：Lucene；相似度；全文检索文献标志码：A 中图分类号：TP311 doi ：10.3778/j.issn.1002-8331.1203-0223 1 引言通过考虑文档中的平均词频来改进tf 公式和改进文档 Lucene[1] 作为Apache 软件基金会jakarta 项目组的长度标准化因子使得系统对长短文档的“惩罚”更加合一个子项目，是一个用Java 语言实现的开放源代码的全理，以此改进Lucene 的相似度评分机制。文献[3]实现文检索引擎工具包。Lucene 以其开源特性、优异的索引了新的中文分析器，并将其应用于Lucene 中，来提高中结构、高性能、可伸缩、跨平台、易使用等特点，被广