我使用Cloudera 5.2 VM和pandas 0.18.0 我想将kmeans应用于我的数据帧。但我有str列。
我的数据框是
adClicksPerTime.head(n=5)
Out[50]:
timestamp adCategory userId totalAdClicks
0 2016-05-26 15:00:00 automotive 355 1
1 2016-05-26 15:00:00 clothing 1027 1
2 2016-05-26 15:00:00 computers 1821 1
3 2016-05-26 15:00:00 computers 2139 1
4 2016-05-26 15:00:00 electronics 253 1
for col in adClicksPerTime:
print(col)
print(type(adClicksPerTime[col][1]))
timestamp
<class 'pandas.tslib.Timestamp'>
adCategory
<class 'str'>
userId
<class 'numpy.int64'>
totalAdClicks
<class 'numpy.int64'>
当我执行kmeans时,我收到错误
ValueError: could not convert string to float: 'automotive'
我尝试将我的字符串转换为分类类型,然后再分配数字代码
adClicksPerTime.adCategory = pd.Categorical.from_array(adClicksPerTime.adCategory)
adClicksPerTime.head(n=5)
Out[54]:
timestamp adCategory userId totalAdClicks
0 2016-05-26 15:00:00 automotive 355 1
1 2016-05-26 15:00:00 clothing 1027 1
2 2016-05-26 15:00:00 computers 1821 1
3 2016-05-26 15:00:00 computers 2139 1
4 2016-05-26 15:00:00 electronics 253 1
for col in adClicksPerTime:
print(col)
print(type(adClicksPerTime[col][1]))
timestamp
<class 'pandas.tslib.Timestamp'>
adCategory
<class 'str'>
userId
<class 'numpy.int64'>
totalAdClicks
<class 'numpy.int64'>
如何将kmeans应用于此str字段?
答案 0 :(得分:1)
获取傻瓜会将类别更改为傻瓜。
dummies = pd.get_dummies(adClicksPerTime[adCategory])
del dummies['automotive']
print dummies.columns
然后将此DataFrame与adClicksPerTime
dataFrame合并,最后应用Kmeans。
adClicksPerTime.info()
会给你dtypes。