• 论文 • 上一篇    下一篇

基于距离相关系数的分层聚类法

张璐, 孔令臣, 陈黄岳   

  1. 北京交通大学理学院, 北京 100044
  • 收稿日期:2018-06-05 出版日期:2019-09-15 发布日期:2019-08-21
  • 基金资助:

    国家自然科学基金(批准号:11431002和11671029)资助.

张璐, 孔令臣, 陈黄岳. 基于距离相关系数的分层聚类法[J]. 计算数学, 2019, 41(3): 320-334.

Zhang Lu, Kong Lingchen, Chen Huangyue. AGGLOMERATIVE HIERARCHICAL CLUSTERING VIA DISTANCE CORRELATION[J]. Mathematica Numerica Sinica, 2019, 41(3): 320-334.

AGGLOMERATIVE HIERARCHICAL CLUSTERING VIA DISTANCE CORRELATION

Zhang Lu, Kong Lingchen, Chen Huangyue   

  1. School of Science, Beijing Jiaotong University, Beijing 100044, China
  • Received:2018-06-05 Online:2019-09-15 Published:2019-08-21
随着大数据时代的到来,各个领域涌现出海量数据且结构复杂.如变量的维数不同、尺度不同等.而现实中变量之间往往存在着不确定关系,经典的Pearson相关系数仅能反映两个同维变量间的线性相关关系,不足以完全刻画变量间的相关关系.2007年Szekely等提出的距离相关系数则能描述不同维数变量间的非线性关系.为了探索变量之间的内在信息,本文基于距离相关系数提出了最大距离相关系数法对变量聚类,且有超度量性和空间收缩性.为充分发挥距离相关系数的优势,对上述方法改进得到类整体距离相关系数法.该方法在刻画两类间相似性时,将每类中的所有变量合并成一个整体,再计算这两个不同维数的整体间的距离相关系数.最后,将类整体距离相关系数法应用到几个实际问题中,验证了算法的有效性.
With the advent of the era of big data, huge amounts of data have appeared in various fields with complex structure, such as different dimensions and scales. As we know, the classical Pearson correlation measures the linear relationship between two random variables in equal dimension. In 2007, Szekely et.al proposed distance correlation (DC) that characterizes multivariate independence for random variables in arbitrary dimension. In order to explore the internal relationship between variables, in this paper, we study two agglomerative hierarchical clustering methods. We firstly propose complete distance correlation clustering (complete DC clustering) for variable clustering, which has ultrametricity and space contractibility. Secondly, we propose union DC clustering via improving the complete DC clustering. Numerical results for real data are reported to demonstrate the efficiency of our proposed union distance correlation clustering.

MR(2010)主题分类: 

()
[1] Pearson K. Contributions to the mathematical theory of evolution[J]. Philosophical Transactions of the Royal Society of London, 1894, A 185(1):71-110.

[2] Li S Z, Rizzo M L. K-groups:A Generalization of k-means clustering[J]. ArXiv e-prints, 2017.

[3] Szekely G J, Rizzo M L, Bakirov N K. Measuring and testing independence by correlation of distances[J]. Annals of Statistics, 2007, 35(6):2769-2794.

[4] Kong J, Klein B E K, Klein R, Lee K, Wahba G. Using distance correlation and SS-AVOVA to assess associations of familial relationships, lifestyle factors, diseases, and mortality[J]. Proceeding of the National Academy of Sciences, 2012, 109(50):20352-20357.

[5] Li R, Zhong W, Zhu L. Feature screening via distance correlation learning[J]. Journal of the American Statistical Association, 2012, 107, 1129-1139.

[6] Sheng W, Yin X. Direction estimation in single-index models via distance covariance[J]. Journal of Multivariate Analysis, 2013, 122:148-161.

[7] Sheng W, Yin X. Sufficient dimension reduction via distance covariance[J].Journal of Computational Graphical Statistics, 2016, 25:91-104.

[8] Van Roovij A C M. Non Archimedean Functional Analysis[M]. New York:M. Dekker, 1978.

[9] Chen Z M, Van Ness J W. Space-conserving and agglomerative algorithms[J]. Journal of Classification, 1996, 13:157-163.

[10] 张敏强.教育与心理统计学[M].北京:人民教育出版社, 1993, 313-313.

[11] 吴诚欧,秦伟良.近代实用多元统计分析[M].北京:气象出版社, 2007, 148-150.
[1] 夏雨晴, 张振跃. 子空间聚类的重建模型及其快速算法[J]. 计算数学, 2019, 41(1): 1-11.
[2] 刘歆, 吴国宝, 张瑞, 张在坤. 一种连续的谱聚类优化模型[J]. 计算数学, 2018, 40(4): 354-366.
[3] 石子烨, 梁恒, 白峰杉. 数据分割的分子动力学算法[J]. 计算数学, 2014, 36(3): 325-334.
阅读次数
全文


摘要