我有一个.csv
文件是在nd.array
处理输入数据后从sklearn.cluster.DBSCAN()
创建的。我希望能够“绑定”群集中的每个点到我的输入文件中的列给出的唯一标识符。
这就是我正在阅读input_data
:
# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)
这就是它的样子:
RecordID Storm RL15_LATITUDE RL15_LONGITUDE
2004_Charley95104-257448 2004_Charley 25.81774 -80.25079
2004_Charley93724-254950 2004_Charley 26.116338 -81.74986
2004_Charley93724-254949 2004_Charley 26.116338 -81.74986
2004_Charley75496-215198 2004_Charley 26.11817 -81.75756
在一些帮助下,我能够获取DBSCAN
的输出并将其保存为.CSV
这样的文件:
clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')
我的输出现在是多索引,但我想知道是否有一种方法可以将列点更改为RecordID
而不仅仅是一个数字? :
cluster point RL15_LATITUDE RL15_LONGITUDE
0 0 -81.0625 29.234375
0 1 -81.0625 29.171875
0 2 -81.0625 29.359375
1 0 -81.0625 29.25
1 1 -81.0625 29.21875
1 2 -81.0625 29.25
1 3 -81.0625 29.21875
答案 0 :(得分:1)
<强>代码:强>
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)
cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13
db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df['cluster'] = labels
res = df[df.cluster >= 0]
print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())
输出:
--------------------------------
RecordID Storm RL15_LATITUDE RL15_LONGITUDE cluster
5 2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
13 2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
18 2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
19 2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
22 2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
.. ... ... ... ... ...
812 2004_Charley94314-256039 2004_Charley 28.287827 -81.353285 1
813 2004_Charley93913-255344 2004_Charley 26.532980 -82.194400 7
814 2004_Charley93913-255346 2004_Charley 27.210467 -81.863720 5
815 2004_Charley93913-255357 2004_Charley 26.935550 -82.054447 4
816 2004_Charley93913-255354 2004_Charley 26.935550 -82.054447 4
[688 rows x 5 columns]
--------------------------------
1 217
0 170
2 145
4 94
7 18
6 16
5 14
3 14
Name: cluster, dtype: int64
旧回答:
如果我理解你的代码你可以这样做:
# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']
前10行:
In [148]: df.head(10)
Out[148]:
Storm RL15_LATITUDE RL15_LONGITUDE
RecordID
2004_Charley67146-196725 2004_Charley 33.807550 -78.701172
2004_Charley73944-211790 2004_Charley 33.618435 -78.993407
2004_Charley73944-211793 2004_Charley 28.609200 -80.818880
2004_Charley73944-211789 2004_Charley 29.383210 -81.160100
2004_Charley73944-211786 2004_Charley 33.691235 -78.895129
2004_Charley73944-211787 2004_Charley 29.228560 -81.034440
2004_Charley73944-211795 2004_Charley 28.357253 -80.701632
2004_Charley73944-211792 2004_Charley 34.204490 -77.924700
2004_Charley66636-195501 2004_Charley 33.436717 -79.132074
2004_Charley66631-195496 2004_Charley 33.646292 -78.977968
聚类:
cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13
db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
将群集信息设置为我们的DF - 我们可以简单地将其指定为labels
,其长度与我们的DF相同:
df['cluster'] = labels
过滤器:仅保留cluster >= 0
:
res = df[df.cluster >= 0]
结果:
In [152]: res.head(10)
Out[152]:
Storm RL15_LATITUDE RL15_LONGITUDE cluster
RecordID
2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
2004_Charley67376-197429 2004_Charley 29.196990 -80.993800 0
2004_Charley73720-211013 2004_Charley 29.171450 -81.037170 0
2004_Charley73705-210991 2004_Charley 28.308746 -81.424273 1
2004_Charley65157-192371 2004_Charley 28.308746 -81.424273 1
2004_Charley65126-192326 2004_Charley 28.308746 -81.424273 1
统计:
In [151]: res.cluster.value_counts()
Out[151]:
1 217
0 170
2 145
4 94
7 18
6 16
5 14
3 14
Name: cluster, dtype: int64
如果您不希望RecordID
作为索引:
In [153]: res = res.reset_index()
In [154]: res.head(10)
Out[154]:
RecordID Storm RL15_LATITUDE RL15_LONGITUDE cluster
0 2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
1 2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
2 2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
3 2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
4 2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
5 2004_Charley67376-197429 2004_Charley 29.196990 -80.993800 0
6 2004_Charley73720-211013 2004_Charley 29.171450 -81.037170 0
7 2004_Charley73705-210991 2004_Charley 28.308746 -81.424273 1
8 2004_Charley65157-192371 2004_Charley 28.308746 -81.424273 1
9 2004_Charley65126-192326 2004_Charley 28.308746 -81.424273 1