如何在python pandas中逐一阅读专栏?

时间:2017-07-04 04:30:38

标签: python pandas scikit-learn anaconda

我从URL读取文件如下:

    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

    names = ['sepal length', 'sepal width', 'petal length', 'petal width',  'class']

    data = pd.read_csv(url, names=names)

    print(data.shape)

    print(data)

现在,我想阅读一个列并进行一些处理(可能是min,max或std dev,r得分等),然后再次读取另一列并进行一些处理。

有没有办法在scikit learn / pandas / python中做到这一点?

2 个答案:

答案 0 :(得分:4)

您可以使用describe

data.describe()

输出:

       sepal length  sepal width  petal length  petal width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

或单列:

data['petal length'].describe()

输出:

count    150.000000
mean       3.758667
std        1.764420
min        1.000000
25%        1.600000
50%        4.350000
75%        5.100000
max        6.900000
Name: petal length, dtype: float64

或者您可以使用apply与lambda按列进行自定义处理。

data.apply(lambda x: x.describe())

输出:

        sepal length  sepal width  petal length  petal width        class
25%         5.100000     2.800000      1.600000     0.300000          NaN
50%         5.800000     3.000000      4.350000     1.300000          NaN
75%         6.400000     3.300000      5.100000     1.800000          NaN
count     150.000000   150.000000    150.000000   150.000000          150
freq             NaN          NaN           NaN          NaN           50
max         7.900000     4.400000      6.900000     2.500000          NaN
mean        5.843333     3.054000      3.758667     1.198667          NaN
min         4.300000     2.000000      1.000000     0.100000          NaN
std         0.828066     0.433594      1.764420     0.763161          NaN
top              NaN          NaN           NaN          NaN  Iris-setosa
unique           NaN          NaN           NaN          NaN            3

答案 1 :(得分:0)

一些虚拟数据

data = pd.DataFrame({'sepal length' : np.random.randn(3), 'sepal width' : np.random.randn(3)})

如果你想逐个对所有列进行一些自定义计算,那么你可以在列名上申请循环

>>>for col in data.columns:
            print(col)
            print(np.mean(data[col]))
[out]: 'sepal length'
        -1.06206436799
       'sepal width'
       -0.586939385059

如果要导入pandas dataframe中的数据,那么这将是输出。您还可以在循环中的列中包含自定义操作。