Question

我有几个csv文件（全部在一个文件夹中），它们有共同的列，但也有不同的列。它们都包含IP列。数据看起来像

File_1.csv

    #if TARGET_OS_SIMULATOR
    vc.camera = mockCamera
    #endif

File_2.csv

a,IP,b,c,d,e
info,192.168.0.1,info1,info2,info3,info4

正如您所看到的，文件1和文件2不同意列d中的内容，但我不介意它保留信息的文件。我已经尝试过pandas.merge但是这会返回两个单独的192.168.0.1条目，其中NaN位于文件1中的列中而不是文件2中，反之亦然。有谁知道这样做的方法？

编辑1：

所需的输出应如下所示：

输出

a,b,IP,d,f,g
info,,192.168.0.1,info2,info5,info6

我希望所有行的输出都是这样的，而不是文件1中的每个项目都在file2中，反之亦然。

编辑2：

文件1中存在但文件2中不存在的任何IP地址在输出文件的任何唯一列中都应具有空白或不可用值。例如，在输出文件中，对于文件1中但不存在于文件2中的IP地址，列f和g将为空。同样，对于文件2中的IP而不是文件1中的IP，列c和e将为空在输出文件中。

Answer 1

此案例：

将IP_address设置为索引列，然后使用combine_first()填写data_frame中的漏洞，这是所有IP_address和列的并集。

import pandas as pd
#read in the files using the IP address as the index column
df_1 = pd.read_csv('file1.csv', header= 0, index_col = 'IP')
df_2 = pd.read_csv('file2.csv', header= 0, index_col = 'IP')
#fill in the Nan
combined_df = df_1.combine_first(df_2)
combined_df.write_csv(path = '', sep = ',')

编辑：将采用索引的并集，因此我们应该将IP地址放在索引列中，以确保读入两个文件中的IP地址。

combine_first()其他案例：

正如documentation所述，如果两个文件中的相同IP地址与列的非空信息冲突（例如上例中的column d），则必须小心谨慎。在df_1.combine_first(df_2)中，df_1具有优先级，column d将设置为df_1的值。既然你说过，在这种情况下从哪个文件中提取信息并不重要，这不是一个问题。

Answer 2

我认为一本简单的字典应该可以胜任。假设您已将每个文件的内容读入列表file1和file2，以便：

file1[0] = [a,IP,b,c,d,e]
file1[1] = [info,192.168.0.1,info1,info2,info3,info4]
file2[0] = [a,b,IP,d,f,g]
file2[1] = [info,,192.168.0.1,info2,info5,info6]

（每个条目周围都有引号）。以下应该做你想要的：

new_dict = {}

for i in range(0, len(file2[0]):
    new_dict[file2[0][i]] = file2[1][i]

for i in range(0, len(file1[0]):
    new_dict[file1[0][i]] = file1[1][i]

output = [[],[]]
output[0] = [key for key in new_dict]
output[1] = [new_dict[key] for key in output[0]]

然后你应该得到：

output[0] = [a,IP,b,c,d,e,f,g]
output[1] = [info,192.168.0.1,info1,info2,info3,info4,info5,info6]

将csv与一些常用列合并并填入Nans

2 个答案: