使它更简单,熊猫多文件管理器迭代

时间:2019-01-02 11:36:48

标签: python-3.x pandas loops dataframe

import pandas as pd
import glob
import csv
files=glob.glob('*.csv')
for file in files:

    df=pd.read_csv(file, header= None)
    output_file_name = "output_" + file
    with open(output_file_name, 'w') as f:
        f.write("sum of the 1. column is " + str(df.iloc[:, 0].sum())+"\n")
        f.write("sum of the 2. column is " + str(df.iloc[:, 1].sum())+"\n")
        f.write("sum of the 3. column is " + str(df.iloc[:, 2].sum())+"\n")
        f.write("sum of the 4. column is " + str(df.iloc[:, 3].sum())+"\n")

        f.write("max of the 1. column is " + str(df.iloc[:, 0].max()) + "\n")
        f.write("max of the 2. column is " + str(df.iloc[:, 1].max()) + "\n")
        f.write("max of the 3. column is " + str(df.iloc[:, 2].max()) + "\n")
        f.write("max of the 4. column is " + str(df.iloc[:, 3].max()) + "\n")

    f.close()

如何遍历我的熊猫文件,这样我就不必再次重复所有这些行了。我想要具有有关最大和总和的信息的相同输出文件。 对于每个csv文件,我希望在同一文件夹中描述一个新文件,该文件描述了max,sum,stdn等。例如,输出文件将是:

sum of the 1. column is 21
sum of the 2. column is 23
sum of the 3. column is 33
sum of the 4. column is 30
max of the 1. column is 6
max of the 2. column is 6
max of the 3. column is 8
max of the 4. column is 9

如何使它更简单:D:D Tnx

2 个答案:

答案 0 :(得分:1)

使用iloc来选择前4列,然后通过agg应用函数,创建以1开头的列,以stack进行整形,使用列表理解和最后一个创建列表通过Series.to_csv写入文件:

files = glob.glob('*.csv')
for file in files:
    df = pd.read_csv(file, header= None)
    df1 = df.iloc[:, :4].agg(['sum','max','std'])
    df1.columns = range(1, len(df1.columns) + 1)
    s = df1.stack()
    L = ['{} of the {}. column is {}'.format(a, b, c) for (a, b), c in s.items()]

    output_file_name = "output_" + file
    pd.Series(L).to_csv(output_file_name, index=False)

答案 1 :(得分:1)

您可以使用双for循环迭代所有函数和列:

for funcname in ['sum', 'max', 'std']:
    for i in range(len(df.columns)):
        f.write("sum of the {} column is {}\n"
                .format(i+1, getattr(df.iloc[:, 0], funcname)()))

getattr(df, 'sum') is equivalent to df.sum


import pandas as pd
import glob
import csv
files = glob.glob('*.csv')
for file in files:

    df = pd.read_csv(file, header=None)
    output_file_name = "output_" + file
    with open(output_file_name, 'w') as f:
        # f.write("{}\n".format(df.describe()))
        for funcname in ['sum', 'max', 'std']:
            for i in range(len(df.columns)):
                f.write("sum of the {} column is {}\n"
                        .format(i+1, getattr(df.iloc[:, 0], funcname)()))

请注意,df.describe()以简洁的格式显示摘要统计信息。您可能需要考虑仅打印df.describe()

In [26]: df = pd.DataFrame(np.random.random((10,6)))

In [27]: df
Out[27]: 
          0         1         2         3         4         5
0  0.791727  0.397873  0.924195  0.202464  0.789961  0.077095
1  0.920516  0.637618  0.383694  0.623393  0.328440  0.606576
2  0.844562  0.231242  0.183842  0.902065  0.286643  0.743508
3  0.411101  0.370284  0.249545  0.955745  0.561450  0.597586
4  0.185035  0.989508  0.522821  0.218888  0.569865  0.773848
5  0.196904  0.377201  0.816561  0.914657  0.482806  0.686805
6  0.809536  0.480733  0.397394  0.152101  0.645284  0.921204
7  0.004433  0.168943  0.865408  0.472513  0.188554  0.012219
8  0.534432  0.739246  0.628112  0.789579  0.268880  0.835339
9  0.701573  0.580974  0.858254  0.461687  0.493617  0.285601

In [28]: df.describe()
Out[28]: 
               0          1          2          3          4          5
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000
mean    0.539982   0.497362   0.582983   0.569309   0.461550   0.553978
std     0.324357   0.246491   0.274233   0.313254   0.189960   0.318598
min     0.004433   0.168943   0.183842   0.152101   0.188554   0.012219
25%     0.250453   0.372014   0.387119   0.279588   0.297092   0.363598
50%     0.618003   0.439303   0.575466   0.547953   0.488212   0.646691
75%     0.805084   0.623457   0.847830   0.873943   0.567761   0.766263
max     0.920516   0.989508   0.924195   0.955745   0.789961   0.921204