+-----------+-------+-----------------------+
|V1 | n | ip |
+-----------+-------+-----------------------+
|02-08-2017 |2 |00.121.187.120:3447 |
|03-08-2017 |5 |01.110.186.182:23 |
|30-07-2017 |13 |08.167.141.192:25 |
|26-07-2017 |19 |1.175.4.214:33274 |
|01-08-2017 |72 |10.174.218.134:59259 |
+-----------+-------+-----------------------+
这是我的csv文件,我正在尝试使用群集技术但是我的专栏" V1"保存为字符串因此我无法读取它。
import pandas
import pylab as pl
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import ast
variables = pandas.read_csv('D:\\Date\\date-dfki.csv',dtype=str)
Y = variables[['V1']]
X = variables[['n']]
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
k均值
score = [kmeans[i].fit(Y).score(Y) for i in range(len(kmeans))]
得分
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
请有人告诉我如何阅读它,因为我无法将字符串转换为float / int我也无法继续。 这是我得到的错误:
array = np.array(array, dtype=dtype, order=order, copy=copy)
**ValueError: could not convert string to float: '27-07-2017'**
答案 0 :(得分:-1)
在此列出可以做什么的示例。
import pandas as pd
import numpy as np
#create dataset sample
d = {'V1': ["02-08-2017" , "03-08-2017"], 'n': ["2", "5"],'ip': ["104.44.194.237:25", "106.42.34.86:49324 "] }
df = pd.DataFrame(data=d, dtype=np.int8)
df.to_csv('date-dfki.csv', sep=',')
#here from where starts your read file:
parse_dates = ['V1'] #specify the column you need for datetime, because on read pandas automatically read the date as string.
variables = pd.read_csv('date-dfki.csv', dtype={'V1': str, 'ip': np.str, 'n': np.int32}, parse_dates=parse_dates) #in data type you specify each column what format to use
变量数据集:
接下来,您需要清除IP地址,以便转换为int或float或您想要的任何其他格式(我使用int):
variables['ip'] = variables['ip'].str.replace('.', '') #removes '.'
variables['ip'] = variables['ip'].str.replace(':', '') #removes ':'
variables['ip'] = variables['ip'].astype(int) #convert to 'int'
结果如下:
因此,如果您有多个列,则可以为每个列执行相同的过程,并以您想要的任何格式进行转换。
这是浮动转换:
variables['ip'] = variables['ip'].astype(float) #or float conversion