我从URL读取文件如下:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
data = pd.read_csv(url, names=names)
print(data.shape)
print(data)
现在,我想阅读一个列并进行一些处理(可能是min,max或std dev,r得分等),然后再次读取另一列并进行一些处理。
有没有办法在scikit learn / pandas / python中做到这一点?
答案 0 :(得分:4)
您可以使用describe
:
data.describe()
输出:
sepal length sepal width petal length petal width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
或单列:
data['petal length'].describe()
输出:
count 150.000000
mean 3.758667
std 1.764420
min 1.000000
25% 1.600000
50% 4.350000
75% 5.100000
max 6.900000
Name: petal length, dtype: float64
或者您可以使用apply
与lambda按列进行自定义处理。
data.apply(lambda x: x.describe())
输出:
sepal length sepal width petal length petal width class
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
count 150.000000 150.000000 150.000000 150.000000 150
freq NaN NaN NaN NaN 50
max 7.900000 4.400000 6.900000 2.500000 NaN
mean 5.843333 3.054000 3.758667 1.198667 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
std 0.828066 0.433594 1.764420 0.763161 NaN
top NaN NaN NaN NaN Iris-setosa
unique NaN NaN NaN NaN 3
答案 1 :(得分:0)
一些虚拟数据
data = pd.DataFrame({'sepal length' : np.random.randn(3), 'sepal width' : np.random.randn(3)})
如果你想逐个对所有列进行一些自定义计算,那么你可以在列名上申请循环
>>>for col in data.columns:
print(col)
print(np.mean(data[col]))
[out]: 'sepal length'
-1.06206436799
'sepal width'
-0.586939385059
如果要导入pandas dataframe中的数据,那么这将是输出。您还可以在循环中的列中包含自定义操作。