ML算法：迷你批量K均值聚类算法

2021年5月5日13:41:10 发表评论 770 次浏览

K均值是最流行的聚类算法之一, 主要是因为其良好的时间性能。随着所分析的数据集大小的增加, K均值的计算时间增加了, 因为它需要在主存储器中存储整个数据集。由于这个原因, 已经提出了几种方法来减少算法的时间和空间成本。另一种方法是迷你批量K均值算法.

``````Given a dataset D = {d1, d2, d3, .....dn}, no. of iterations t, batch size b, no. of clusters k.

k clusters C = {c1, c2, c3, ......ck}

initialize k cluster centers O = {o1, o2, .......ok}
# _initialize each cluster
Ci = Φ (1=<i =<k)
# _initialize no. of data in each cluster
Nci = 0 (1=<i =<k)

for j=1 to t do:
# M is the batch dataset and xm
# is the sample randomly chosen from D
M = {xm | 1 =<m =<b}

# catch cluster center for each
# sample in the batch data set
for m=1 to b do:
oi(xm) = sum(xm)/|c|i (xm ε M and xm ε ci)
end for
# update the cluster center with each batch set

for m=1 to b do:
# get the cluster center for xm
oi = oi(xm)
# update number of data for each cluster center
Nci = Nci + 1
#calculate learning rate for each cluster center
lr=1/Nci
# take gradient step to update cluster center
oi = (1-lr)oi + lr*xm
end for
end for``````

scikit学习库：

``````from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.datasets.samples_generator import make_blobs

batch_size = 45
centers = [[ 1 , 1 ], [ - 2 , - 1 ], [ 1 , - 2 ], [ 1 , 9 ]]
n_clusters = len (centers)
X, labels_true = make_blobs(n_samples = 3000 , centers = centers, cluster_std = 0.9 )

# perform the mini batch K-means
mbk = MiniBatchKMeans(init = 'k-means++' , n_clusters = 4 , batch_size = batch_size, n_init = 10 , max_no_improvement = 10 , verbose = 0 )

mbk.fit(X)
mbk_means_cluster_centers = np.sort(mbk.cluster_centers_, axis = 0 )
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)

# print the labels of each data
print (mbk_means_labels)``````

https://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html