基于中文地址类信息的分词处理 A segment method of chinese address information期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于中文地址类信息的分词处理

引用本文：	刘哲,夏秀峰,周福才.基于中文地址类信息的分词处理[J].沈阳航空工业学院学报,2008,25(4).

作者姓名：	刘哲夏秀峰周福才

作者单位：	1. 沈阳师范大学,计算中心,辽宁,沈阳,1100341;沈阳航空工业学院,计算机学院,辽宁,沈阳,110136 2. 沈阳航空工业学院,计算机学院,辽宁,沈阳,110136 3. 东北大学信息科学与工程学院,辽宁,沈阳,110004

摘要：	数据仓库中脏数据处理的热点问题是识别与消除相似重复记录.针对中文地址类重复信息的处理,提出了一种基于特征字符的分词策略,在建立了包含分词规则的元数据库基础上,描述了基于特征字符的分词算法.实验结果表明分词所用的时间随着数据集的增长变化不大.因此,将分词方法应用于中文地址类重复记录的检测,也不会增加检测的时间.
关键词：	相似重复记录中文地址特征字符分词
A segment method of chinese address information

LIU Zhe,XIA Xiu-feng,ZHOU Fu-cai.A segment method of chinese address information[J].Journal of Shenyang Institute of Aeronautical Engineering,2008,25(4).

Authors:	LIU Zhe XIA Xiu-feng ZHOU Fu-cai

Abstract:	It's a hot issue to eliminate approximately duplicated records in cleansing dirty data of data warehouse.Aiming at processing of Chinese address information,a segment mechanism based on the feature word is proposed.The meta-database of segment rules is established,and the feature word based segment algorithm is presented.The experiment results indicate that the segment time is invariable along with the data set growing.So this method can be used in detecting approximately duplicated records,but the detecting time will not increase.

Keywords:	Approximately duplicated records Chinese address information Tagged word Segment
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏