一种傅里叶域海量数据高速谱聚类方法 |
| |
引用本文: | 张熳,徐兆瑞,沈项军. 一种傅里叶域海量数据高速谱聚类方法[J]. 北京航空航天大学学报, 2022, 48(8): 1445-1454. DOI: 10.13700/j.bh.1001-5965.2021.0537 |
| |
作者姓名: | 张熳 徐兆瑞 沈项军 |
| |
作者单位: | 江苏大学 计算机科学与通信工程学院, 镇江 212013 |
| |
基金项目: | 国家自然科学基金61572240 |
| |
摘 要: | 谱聚类方法广泛应用于数据挖掘和模式识别等领域,但大规模数据上高计算代价的特征向量求解及大数据带来的巨大内存需求,使得其应用于大规模数据时受到了极大的限制。为此,研究了基于傅里叶域的海量数据高速谱聚类方法。利用数据模式的重复性特点在傅里叶域建模,将耗时的特征向量计算转化为对预先确定的傅里叶域判别基进行选择来确定最终的特征向量,计算过程只需进行简单的乘法和加法运算,计算量得到极大的约减; 分批次训练样本,使用部分样本即可估计出整体数据的特征向量分布,确定最终的特征向量,压缩了计算时间和内存需求。在Ijcnn1、RCV1、Covtype-mult、Poker及MNIST-8M等大规模数据上的实验结果表明,所提方法在聚类精度等各项指标基本保持的前提下,训练时间相比FastESC、LSSHC、SC_RB、SSEIGS及USPEC等方法最高快了810.58倍,证明了所提方法在处理大规模聚类数据方面具有显著优势。
|
关 键 词: | 谱聚类 傅里叶域 海量数据 高速计算 低内存需求 |
收稿时间: | 2021-09-08 |
A high-speed spectral clustering method in Fourier domain for massive data |
| |
Affiliation: | School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China |
| |
Abstract: | Spectral clustering is widely used in data mining and pattern recognition. However, due to the high computational cost of eigenvector solutions and the huge memory requirements brought by big data, spectral clustering algorithm is greatly limited when it is applied to large-scale data. Therefore, this paper studies a high-speed spectral clustering method for massive data in the Fourier domain. This method makes full use of the repeatability of data pattern, and uses this characteristic to model in the Fourier domain. To get final eigenvectors, the time-consuming eigenvector pursuitcan be transformed into the selection of the pre-determined discriminant basis in the Fourier domain. The calculation process only needs simple multiplication and addition, so the amount of time for calculation is greatly reduced. On the other hand, due to the characteristics of calculation in the Fourier domain, another advantage of this method is that it can train the samples in batches, that is, only using part of the samples can well estimate eigenvector distribution in the whole data. The experimental results on large-scale data such as Ijcnn1, RCV1, Covtype-mult, Poker and MNIST-8M show that the training time of the proposed method is at most 810.58 times faster than that of algorithms FastESC, LSSHC, SC_RB, SSEIGS and USPEC, on the premise that the clustering accuracy and other indicators are basically maintained, which proves that the proposed method has significant advantages in processing large-scale data. |
| |
Keywords: | |
|
| 点击此处可从《北京航空航天大学学报》浏览原始摘要信息 |
|
点击此处可从《北京航空航天大学学报》下载全文 |
|