我使用的是高斯混合模型(GMM)从sklearn.mixture
执行我的数据集的群集。
我可以使用该函数score()
来计算模型下的对数概率。
但是,我正在寻找本文定义的称为“纯度”的指标。
如何在Python中实现它?我当前的实现如下所示:
from sklearn.mixture import GMM # X is a 1000 x 2 array (1000 samples of 2 coordinates). # It is actually a 2 dimensional PCA projection of data # extracted from the MNIST dataset, but this random array # is equivalent as far as the code is concerned. X = np.random.rand(1000, 2) clusterer = GMM(3, 'diag') clusterer.fit(X) cluster_labels = clusterer.predict(X) # Now I can count the labels for each cluster.. count0 = list(cluster_labels).count(0) count1 = list(cluster_labels).count(1) count2 = list(cluster_labels).count(2)
但是我无法遍历每个群集以计算混淆矩阵(根据此问题)
大卫的答案有效,但这是另一种方法。
import numpy as np from sklearn import metrics def purity_score(y_true, y_pred): # compute contingency matrix (also called confusion matrix) contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred) # return purity return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
同样,如果您需要计算逆纯度,您要做的就是将“ axis = 0”替换为“ axis = 1”。