基于信息理论的网络文本组合聚类 Information-theoretic ensemble clustering on web texts期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于信息理论的网络文本组合聚类

引用本文：	王扬,袁昆,刘洪甫,吴俊杰,包秀国. 基于信息理论的网络文本组合聚类[J]. 北京航空航天大学学报, 2016, 42(8): 1603-1611. DOI: 10.13700/j.bh.1001-5965.2015.0507

作者姓名：	王扬袁昆刘洪甫吴俊杰包秀国

作者单位：	1.北京航空航天大学经济管理学院, 北京 100083

基金项目：	国家自然科学基金(71531001;71471009)，国家“863”计划(SS2014AA012303)，中央高校基本科研业务费专项资金.National Natural Science Foundation of China(71531001;71471009)，National High Technology Research and Development Program of China(SS2014AA012303)，the Fundamental Research Funds for the Central Universities

摘要：	尽管近年来针对文本聚类问题进行了大量研究，其仍然是数据挖掘领域的一个富有挑战性的问题，特别在弱相关特征乃至噪声特征的处理上，仍然存在诸多挑战。针对这一问题提出了文本聚类的分解-组合算法框架——DIAS。该方法首先通过简单随机特征抽样将高维文本数据进行分解得到多样化的结构知识，其优点是能够较好地避免产生大量的噪声特征。然后采用基于信息理论的一致性聚类（ICC）将多视角基础聚类知识组合起来，得到高质量的一致性划分。最后通过在8个真实文本数据集上的实验，证明DIAS算法相较于其他被广泛使用的算法具有明显优势，特别在处理弱基础聚类上具有突出效果。由于在分布式计算上的天然优势，DIAS有望成为大规模文本聚类的主流算法。
关键词：	文本聚类分解-组合算法基于信息理论的一致性聚类 K-均值大数据聚类
收稿时间：	2015-07-30
Information-theoretic ensemble clustering on web texts

WANG Yang,YUAN Kun,LIU Hongfu,WU Junjie,BAO Xiuguo. Information-theoretic ensemble clustering on web texts[J]. Journal of Beijing University of Aeronautics and Astronautics, 2016, 42(8): 1603-1611. DOI: 10.13700/j.bh.1001-5965.2015.0507

Authors:	WANG Yang YUAN Kun LIU Hongfu WU Junjie BAO Xiuguo

Affiliation:	1.School of Economics and Management, Beijing University of Aeronautics and Astronautics, Beijing 100083, China2. School of Mechanical Engineering and Automation, Beijing University of Aeronautics and Astronautics, Beijing 100083, China3. College of Engineering, Northeastern University, Boston 02115, USA4. National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China

Abstract:	Although being extensively studied, text clustering remains a critical challenge in data mining community due to the curse of dimensionality. Various techniques have been proposed to overcome this difficulty, but the negative impact of weakly related or even noisy features is yet the hunting nightmare. Meanwhile, we should never lose sight of the explosive growth of unlimited user-generated content on social media, which is extremely sparse and poses further challenge on the efficiency issue. In light of this, a disassemble-assemble (DIAS) framework is proposed for text clustering. Simple random feature sampling is employed by DIAS to disassemble high-dimensional text data and gain diverse structural knowledge by avoiding the bulk of noisy features. Then the multi-view knowledge is assembled by fast information-theoretic consensus clustering (ICC) to gain a high-quality consensus partitioning. Extensive experiments on eight real-world text data sets are conducted to demonstrate the advantages of DIAS over some widely used methods. In particular, DIAS shows appealing merits in learning from a bulk of very weak basic partitionings. Its natural suitability for distributed computing makes DIAS become a promising candidate for big text clustering.

Keywords:	text clustering disassemble-assemble algorithm information-theoretic consensus clustering K-means big data clustering
本文献已被万方数据等数据库收录！
	点击此处可从《北京航空航天大学学报》浏览原始摘要信息
	点击此处可从《北京航空航天大学学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏