Question

我是python中的新手。

我有8个csv文件，每个文件有26列，每行600行。现在我想取每个csv文件的最后4列（第22列到第25列），读取文件并将它们相加以替换每个文件中的所有4列。例如（我在这里显示一些随机数据）：

新1.csv：

a   b   c   d   e   f   g   h   i   j   k
1   1   1   1   1   1   1   1   1   1   1
2   2   2   2   2   2   2   2   2   2   2
3   3   3   3   3   3   3   3   3   3   3
4   4   4   4   4   4   4   4   4   4   4
5   5   5   5   5   5   5   5   5   5   5
6   6   6   6   6   6   6   6   6   6   6
7   7   7   7   7   7   7   7   7   7   7
8   8   8   8   8   8   8   8   8   8   8
9   9   9   9   9   9   9   9   9   9   9

new2.csv：

a   b   c   d   e   f   g   h   i   j   k
11  11  11  11  11  11  11  11  11  11  11
12  12  12  12  12  12  12  12  12  12  12
13  13  13  13  13  13  13  13  13  13  13
14  14  14  14  14  14  14  14  14  14  14
15  15  15  15  15  15  15  15  15  15  15
16  16  16  16  16  16  16  16  16  16  16
17  17  17  17  17  17  17  17  17  17  17
18  18  18  18  18  18  18  18  18  18  18
19  19  19  19  19  19  19  19  19  19  19

现在，我想从这两个文件中对“h，i，j，k”的每个元素求和，然后用这个新的总和替换最后4列的文件。

修改了new-1.csv：

a   b   c   d   e   f   g   h   i   j   k
1   1   1   1   1   1   1   12  12  12  12
2   2   2   2   2   2   2   14  14  14  14
3   3   3   3   3   3   3   16  16  16  16
4   4   4   4   4   4   4   18  18  18  18
5   5   5   5   5   5   5   20  20  20  20
6   6   6   6   6   6   6   22  22  22  22
7   7   7   7   7   7   7   24  24  24  24
8   8   8   8   8   8   8   26  26  26  26
9   9   9   9   9   9   9   28  28  28  28

修改了新的2.csv：

a   b   c   d   e   f   g   h   i   j   k
11  11  11  11  11  11  11  12  12  12  12
12  12  12  12  12  12  12  14  14  14  14
13  13  13  13  13  13  13  16  16  16  16
14  14  14  14  14  14  14  18  18  18  18
15  15  15  15  15  15  15  20  20  20  20
16  16  16  16  16  16  16  22  22  22  22
17  17  17  17  17  17  17  24  24  24  24
18  18  18  18  18  18  18  26  26  26  26
19  19  19  19  19  19  19  28  28  28  28

我假设我应该使用熊猫或numpy，但不知道该怎么做。任何建议/提示将不胜感激。

Answer 1

您只需使用numpy即可完成此操作。

import numpy as np

# list of all the files

file_list = ['foo.csv','bar.csv','baz.csv'] # all 8 files

col_names = ['a','b','c','d','e','f'] # all the names till z if necessary as the first row, else skip this

# initializing a numpy array, for containing sum from last 4 columns

add_cols = np.zeros((600,4))

# iterating over all .csv files

for file in file_list :

    # skiprows will skip the first row and usecols will get values in last 4 cols

    temp = np.loadtxt(file, skiprows=1, delimiter=',' , usecols = (22,23,24,25) )
    add_cols = np.add(temp,add_cols)

# now again overwriting all the files, substituting the last 4 columns with the sum   

for file in file_list :

    #loading the content from file in temp

    temp = np.loadtxt(file, skiprows=1, delimiter=',')
    temp[:,[22,23,24,25]] = add_cols 

    # writing the column names first

    with open(file,'w') as p:
        p.write(','.join(col_names)+'\n')

    # now appending final values in temp to the file as csv

    with open(file,'a')  as p:  
        np.savetxt(p,temp,delimiter=",",fmt="%i")

现在，如果您的文件不是以逗号分隔而是以空格分隔，请从所有函数中删除delimiter选项，因为默认情况下分隔符将被视为space。也相应地加入第一列。

Answer 2

使用read_csv加载csv后，您可以将最后4列添加到一起，然后覆盖它们：

In [10]:
total = df[df.columns[-4:]].values + df1[df1.columns[-4:]].values
total

Out[10]:
array([[12, 12, 12, 12],
       [14, 14, 14, 14],
       [16, 16, 16, 16],
       [18, 18, 18, 18],
       [20, 20, 20, 20],
       [22, 22, 22, 22],
       [24, 24, 24, 24],
       [26, 26, 26, 26],
       [28, 28, 28, 28]], dtype=int64)

In [12]:    
df[df.columns[-4:]] = total
df1[df1.columns[-4:]] = total
df

Out[12]:
   a  b  c  d  e  f  g   h   i   j   k
0  1  1  1  1  1  1  1  12  12  12  12
1  2  2  2  2  2  2  2  14  14  14  14
2  3  3  3  3  3  3  3  16  16  16  16
3  4  4  4  4  4  4  4  18  18  18  18
4  5  5  5  5  5  5  5  20  20  20  20
5  6  6  6  6  6  6  6  22  22  22  22
6  7  7  7  7  7  7  7  24  24  24  24
7  8  8  8  8  8  8  8  26  26  26  26
8  9  9  9  9  9  9  9  28  28  28  28

In [13]:    
df1

Out[13]:
    a   b   c   d   e   f   g   h   i   j   k
0  11  11  11  11  11  11  11  12  12  12  12
1  12  12  12  12  12  12  12  14  14  14  14
2  13  13  13  13  13  13  13  16  16  16  16
3  14  14  14  14  14  14  14  18  18  18  18
4  15  15  15  15  15  15  15  20  20  20  20
5  16  16  16  16  16  16  16  22  22  22  22
6  17  17  17  17  17  17  17  24  24  24  24
7  18  18  18  18  18  18  18  26  26  26  26
8  19  19  19  19  19  19  19  28  28  28  28

我们需要在这里调用属性.values来返回一个np数组，否则会尝试在索引上对齐，在这种情况下不会对齐。

覆盖后，请致电df.to_csv(file_path)和df1.to_csv(file_path)

对于你的8个dfs，你可以循环它们并在循环时聚合：

# take a copy of the firt df's last 4 columns
total = df_list[0]
total = total[total.columns[-4:]].values
for df in df_list[1:]:
    total += df[df.columns[-4:]].values

然后再次遍历你的dfs以覆盖：

for df in df_list:
    df[df.columns[-4:]] = total

然后使用to_csv再次写出来。

python - 用所有文件的总和替换最后n列

2 个答案: