我有一个纬度和经度对的数据框.
这是我的数据帧外观.
order_lat order_long 0 19.111841 72.910729 1 19.111342 72.908387 2 19.111342 72.908387 3 19.137815 72.914085 4 19.119677 72.905081 5 19.119677 72.905081 6 19.119677 72.905081 7 19.120217 72.907121 8 19.120217 72.907121 9 19.119677 72.905081 10 19.119677 72.905081 11 19.119677 72.905081 12 19.111860 72.911346 13 19.111860 72.911346 14 19.119677 72.905081 15 19.119677 72.905081 16 19.119677 72.905081 17 19.137815 72.914085 18 19.115380 72.909144 19 19.115380 72.909144 20 19.116168 72.909573 21 19.119677 72.905081 22 19.137815 72.914085 23 19.137815 72.914085 24 19.112955 72.910102 25 19.112955 72.910102 26 19.112955 72.910102 27 19.119677 72.905081 28 19.119677 72.905081 29 19.115380 72.909144 30 19.119677 72.905081 31 19.119677 72.905081 32 19.119677 72.905081 33 19.119677 72.905081 34 19.119677 72.905081 35 19.111860 72.911346 36 19.111841 72.910729 37 19.131674 72.918510 38 19.119677 72.905081 39 19.111860 72.911346 40 19.111860 72.911346 41 19.111841 72.910729 42 19.111841 72.910729 43 19.111841 72.910729 44 19.115380 72.909144 45 19.116625 72.909185 46 19.115671 72.908985 47 19.119677 72.905081 48 19.119677 72.905081 49 19.119677 72.905081 50 19.116183 72.909646 51 19.113827 72.893833 52 19.119677 72.905081 53 19.114100 72.894985 54 19.107491 72.901760 55 19.119677 72.905081
我想聚集这些彼此最近的点(距离200米)以下是我的距离矩阵.
from scipy.spatial.distance import pdist, squareform distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v)))) array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071, 1.05925366, 1.05420922], [ 0.2522482 , 0. , 0. , ..., 1.44111548, 0.81742536, 0.98978355], [ 0.2522482 , 0. , 0. , ..., 1.44111548, 0.81742536, 0.98978355], ..., [ 1.67313071, 1.44111548, 1.44111548, ..., 0. , 1.02310118, 1.22871515], [ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118, 0. , 1.39923529], [ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515, 1.39923529, 0. ]])
然后我在距离矩阵上应用DBSCAN聚类算法.
from sklearn.cluster import DBSCAN db = DBSCAN(eps=2,min_samples=5) y_db = db.fit_predict(distance_matrix)
我不知道如何选择eps&min_samples值.它在一个星团中聚集了太远的点.(距离约2公里)是因为它在聚类时计算欧氏距离?请帮忙.
您可以使用scikit-learn的DBSCAN对空间纬度 - 经度数据进行聚类,而无需预先计算距离矩阵.
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
这来自本教程,使用scikit-learn DBSCAN对空间数据进行聚类.特别要注意的是,该eps
值仍然是2km,但它除以6371将其转换为弧度.另外,请注意.fit()
以半径为单位的半径度量坐标.
DBSCAN 旨在用于原始数据,具有加速的空间索引.我知道加速地理距离的唯一工具是ELKI(Java) - 不幸的是,scikit-learn仅支持像欧几里德距离这样的一些距离(参见参考资料sklearn.neighbors.NearestNeighbors
).但显然,你可以预先计算成对距离,因此这不是一个问题.
但是,您没有仔细阅读文档,并且您认为DBSCAN使用距离矩阵是错误的:
from sklearn.cluster import DBSCAN db = DBSCAN(eps=2,min_samples=5) db.fit_predict(distance_matrix)
在距离矩阵行上使用欧几里德距离,这显然没有任何意义.
请参阅DBSCAN
(重点添加)的文档:
class sklearn.cluster.DBSCAN(eps = 0.5,min_samples = 5,metric =' euclidean ',algorithm ='auto',leaf_size = 30,p = None,random_state = None)
metric:string或callable
计算要素数组中实例之间距离时使用的度量标准.如果metric是字符串或可调用的,则它必须是metrics.pairwise.calculate_distance为其度量参数所允许的选项之一.如果度量是"预先计算的",则假定X是距离矩阵,并且必须是正方形.X可以是稀疏矩阵,在这种情况下,只有"非零"元素可以被认为是DBSCAN的邻居.
类似于fit_predict
:
X:形状的阵列或稀疏(CSR)矩阵(n_samples,n_features)或形状数组(n_samples,n_samples)
如果metric ='预计算',则为特征数组或样本之间的距离数组.
换句话说,你需要这样做
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
我不知道haversine
你正在使用什么实现,但看起来它以km为单位返回结果,所以eps
应该是0.2,而不是200米.
对于min_samples
参数,这取决于您的预期输出.这里有几个例子.我的输出正在使用haversine
基于此答案的实现,它给出了与您的距离矩阵相似但不相同的距离矩阵.
这是 db = DBSCAN(eps=0.2, min_samples=5)
[0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]
这会创建三个集群,0, 1
并且2
许多样本不会落入至少包含5个成员的集群中,因此不会分配给集群(如图所示-1
).
再次尝试使用较小的min_samples
值:
db = DBSCAN(eps=0.2, min_samples=2)
[0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 - 1 3 3 3 7 -1 3 -1 -1 3]
这里大多数样本都在至少一个其他样本的200米范围内,因此属于八个群集中的0
一个7
.
编辑添加
看起来@ Anony-Mousse是对的,虽然我的结果没有看错.为了贡献某些东西,这里是我用来查看集群的代码:
from math import radians, cos, sin, asin, sqrt from scipy.spatial.distance import pdist, squareform from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt import pandas as pd def haversine(lonlat1, lonlat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lat1, lon1 = lonlat1 lat2, lon2 = lonlat2 lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) r = 6371 # Radius of earth in kilometers. Use 3956 for miles return c * r X = pd.read_csv('dbscan_test.csv') distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v)))) db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed') # using "precomputed" as recommended by @Anony-Mousse y_db = db.fit_predict(distance_matrix) X['cluster'] = y_db plt.scatter(X['lat'], X['lng'], c=X['cluster']) plt.show()