Question

我有一个.csv文件是在nd.array处理输入数据后从sklearn.cluster.DBSCAN()创建的。我希望能够“绑定”群集中的每个点到我的输入文件中的列给出的唯一标识符。

这就是我正在阅读input_data：

的方式

# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"  
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)

这就是它的样子：

RecordID                  Storm         RL15_LATITUDE   RL15_LONGITUDE
2004_Charley95104-257448  2004_Charley  25.81774        -80.25079
2004_Charley93724-254950  2004_Charley  26.116338       -81.74986
2004_Charley93724-254949  2004_Charley  26.116338       -81.74986
2004_Charley75496-215198  2004_Charley  26.11817        -81.75756

在一些帮助下，我能够获取DBSCAN的输出并将其保存为.CSV这样的文件：

clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')

我的输出现在是多索引，但我想知道是否有一种方法可以将列点更改为RecordID而不仅仅是一个数字？：

cluster point   RL15_LATITUDE   RL15_LONGITUDE
0       0   -81.0625    29.234375
0       1   -81.0625    29.171875
0       2   -81.0625    29.359375
1       0   -81.0625    29.25
1       1   -81.0625    29.21875
1       2   -81.0625    29.25
1       3   -81.0625    29.21875

Answer 1

UPDATE：

<强>代码：

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())

输出：

--------------------------------
                     RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
5    2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
13   2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
18   2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
19   2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
22   2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
..                        ...           ...            ...             ...      ...
812  2004_Charley94314-256039  2004_Charley      28.287827      -81.353285        1
813  2004_Charley93913-255344  2004_Charley      26.532980      -82.194400        7
814  2004_Charley93913-255346  2004_Charley      27.210467      -81.863720        5
815  2004_Charley93913-255357  2004_Charley      26.935550      -82.054447        4
816  2004_Charley93913-255354  2004_Charley      26.935550      -82.054447        4

[688 rows x 5 columns]
--------------------------------
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

旧回答：

如果我理解你的代码你可以这样做：

# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'

df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']

前10行：

In [148]: df.head(10)
Out[148]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE
RecordID
2004_Charley67146-196725  2004_Charley      33.807550      -78.701172
2004_Charley73944-211790  2004_Charley      33.618435      -78.993407
2004_Charley73944-211793  2004_Charley      28.609200      -80.818880
2004_Charley73944-211789  2004_Charley      29.383210      -81.160100
2004_Charley73944-211786  2004_Charley      33.691235      -78.895129
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440
2004_Charley73944-211795  2004_Charley      28.357253      -80.701632
2004_Charley73944-211792  2004_Charley      34.204490      -77.924700
2004_Charley66636-195501  2004_Charley      33.436717      -79.132074
2004_Charley66631-195496  2004_Charley      33.646292      -78.977968

聚类：

cols = ['RL15_LONGITUDE','RL15_LATITUDE']

eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

将群集信息设置为我们的DF - 我们可以简单地将其指定为labels，其长度与我们的DF相同：

df['cluster'] = labels

过滤器：仅保留cluster >= 0：

所在的行

res = df[df.cluster >= 0]

结果：

In [152]: res.head(10)
Out[152]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
RecordID
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

统计：

In [151]: res.cluster.value_counts()
Out[151]:
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

如果您不希望RecordID作为索引：

In [153]: res = res.reset_index()

In [154]: res.head(10)
Out[154]:
                   RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
0  2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
1  2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2  2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
3  2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
4  2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
5  2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
6  2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
7  2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
8  2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
9  2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

如何为DataFrame行

1 个答案:

UPDATE：