如何将一个csv文件中的两个字段与其他csv文件的一个字段合并?

时间:2017-05-15 11:07:32

标签: pandas concat

我想合并两个CSV文件,如下所示:

第一个CSV文件:

df = pd.DataFrame()
df["ticket_number"] = ['AAA', 'AAA', 'AAA', 'ABC', 'ABA','ADC','ABA','BBB']
df["train_board_station"] = ['Tokyo', 'LA', 'Paris', 'New_York', 'Delhi','Phoenix', 'London','LA']
df["train_off_station"] = ['Phoenix', 'London', 'Sydney', 'Berlin', 'Shanghai','LA', 'Paris', 'New_York']

第二个CSV文件:

rec = pd.DataFrame()
rec["code"] = ['Tokyo','London','Paris','New_York','Shanghai','LA','Sydney','Berlin','Phoenix','Delhi']
rec["count_A"] = ['1.2','7.8','4','8','7.8','3','8','5','2','10']
rec["count_B"] = ['12','78','4','8','78','36','88','51','25','10']

我使用以下代码:

for x in ["board", "off"]:
    df["station"] = df["train_" + x + "_station"]
    df["code"] = df["train_" + x + "_station"]
    df = pd.concat([df,rec], axis=1, join_axes=[df.index])
    df[x + "_count_A"] = df["count_A"]
    df[x + "_count_B"] = df["count_B"]
    df = df.drop(["station", "code","count_A","count_B"], axis=1)

我得到以下不正确的输出:

ticket_number,train_board_station,train_off_station,board_count_A,board_count_B,off_count_A,off_count_B
AAA,Tokyo,Phoenix,1.2,12,1.2,12
AAA,LA,London,7.8,78,7.8,78
AAA,Paris,Sydney,4,4,4,4
ABC,New_York,Berlin,8,8,8,8
ABA,Delhi,Shanghai,7.8,78,7.8,78
ADC,Phoenix,LA,3,36,3,36
ABA,London,Paris,8,88,8,88
BBB,LA,New_York,5,51,5,51

我注意到,不是count_A和count_B与同一行的train_board station和train_off_station合并,第一行与train_board_station合并,第二行与train_off_station合并两次。

预期输出为:

ticket_number,train_board_station,train_off_station,board_count_A,board_count_B,off_count_A,off_count_B
AAA,Tokyo,Phoenix,1.2,12,2,25
AAA,LA,London,3,36,7.8,78
AAA,Paris,Sydney,4,4,8,88
ABC,New_York,Berlin,8,8,5,51
ABA,Delhi,Shanghai,10,10,7.8,78
ADC,Phoenix,LA,2,26,3,36
ABA,London,Paris,7.7,78,4,4
BBB,LA,New_York,36,36,8,8

1 个答案:

答案 0 :(得分:0)

重复有问题,我使用join左连接:

for x in ["board", "off"]:
    df["code"] = df["station"] = df["train_" + x + "_station"]
    df = df.join(rec.set_index('code'), on='code')
    df[x + "_count_A"] = df["count_A"]
    df[x + "_count_B"] = df["count_B"]
    df = df.drop(["station", "code","count_A","count_B"], axis=1)

print (df)
  ticket_number train_board_station train_off_station board_count_A  \
0           AAA               Tokyo           Phoenix           1.2   
1           AAA                  LA            London             3   
2           AAA               Paris            Sydney             4   
3           ABC            New_York            Berlin             8   
4           ABA               Delhi          Shanghai            10   
5           ADC             Phoenix                LA             2   
6           ABA              London             Paris           7.8   
7           BBB                  LA          New_York             3   

  board_count_B off_count_A off_count_B  
0            12           2          25  
1            36         7.8          78  
2             4           8          88  
3             8           5          51  
4            10         7.8          78  
5            25           3          36  
6            78           4           4  
7            36           8           8