一种XML相似重复数据的清理方法研究 Study on an XML approximately duplicated data cleaning method期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种XML相似重复数据的清理方法研究

引用本文：	陈伟,丁秋林.一种XML相似重复数据的清理方法研究[J].北京航空航天大学学报,2004,30(9):835-838.

作者姓名：	陈伟丁秋林

作者单位：	南京航空航天大学计算机应用研究所, 南京 210016

摘要：	针对半结构化数据XML在数据清理中的重要性,研究了如何清理XML相似重复数据,主要工作有:提出一种有效的XML相似重复数据清理方法,该方法具有较强的适应性,任何XML相似检测算法都适用于此;给出一种基于树编辑距离的相似检测算法,该算法能有效地检测XML相似重复数据;采用树编辑距离的上下限优化基于树编辑距离的相似检测算法,避免了不必要的树编辑距离计算,降低了相似检测计算的复杂度,提高了运算效率.此工作为研究XML相似重复数据清理打下基础.
关键词：	规则库算法库数据清理可扩展标记语言相似重复数据
文章编号：	1001-5965(2004)09-0835-04
收稿时间：	2003-06-02
修稿时间：	2003年6月2日
Study on an XML approximately duplicated data cleaning method

Chen Wei,Ding Qiulin.Study on an XML approximately duplicated data cleaning method[J].Journal of Beijing University of Aeronautics and Astronautics,2004,30(9):835-838.

Authors:	Chen Wei Ding Qiulin

Institution:	Computer Application Institute, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

Abstract:	Aiming at the importance of semi-structured data XML in data cleaning, how to clean XML approximately duplicated data was studied. An efficient XML approximately duplicated data cleaning method was proposed. This method is adaptive, because any other approximately detecting algorithm can be used in it. An efficient approximately detecting algorithm based on tree edit distance was presented. This algorithm can detect approximately duplicated data efficiently. The lower and upper bounds of tree edit distance were used to optimize the approximately duplicated data detecting algorithm. The improved algorithm can avoid computing the tree edit distance that is not needed between a pair of XML data, and reduce the approximate computation complexity. So, foundations are built for researching XML approximately duplicated data cleaning.

Keywords:	rules library algorithms library data cleaning extensible markup language(XML) approximately duplicated data
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《北京航空航天大学学报》浏览原始摘要信息
	点击此处可从《北京航空航天大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏