我有三个CSV文件:
档案1
id,code
1,a
2,b
3,c
4,d
文件2
no,count,sum,class
3,567,55562,Y
5,673,66259,L
1,674,78256,Y
4,344,56789,Y
文件3
record,mean,median
3,5437,553
2,67233,664
1,67234,785
4,34423,556
如果count
和sum
,我想将文件2 中的id
和no
与文件1 合并如果mean
和median
匹配,则文件3 与文件1 匹配并合并id
和record
。我尝试了以下代码,但最终输出文件有很多完整字段,即使它们与id
匹配。
df = pd.concat([file1, file2,file3], join_axes=[df.index])
df= df.drop["class"]
答案 0 :(得分:1)
我认为您需要在read_csv
的第一列设置索引:
import pandas as pd
from pandas.compat import StringIO
temp=u"""id,code
1,a
2,b
3,c
4,d"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
file1 = pd.read_csv(StringIO(temp), index_col=[0])
print (file1)
temp=u"""
no,count,sum,class
3,567,55562,Y
5,673,66259,L
1,674,78256,Y
4,344,56789,Y"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
file2 = pd.read_csv(StringIO(temp), index_col=[0])
print (file2)
temp=u"""
record,mean,median
3,5437,553
2,67233,664
1,67234,785
4,34423,556"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
file3 = pd.read_csv(StringIO(temp), index_col=[0])
print (file3)
df = pd.concat([file1, file2,file3], axis=1).drop("class", axis=1)
print (df)
code count sum mean median
1 a 674.0 78256.0 67234.0 785.0
2 b NaN NaN 67233.0 664.0
3 c 567.0 55562.0 5437.0 553.0
4 d 344.0 56789.0 34423.0 556.0
5 NaN 673.0 66259.0 NaN NaN
如果未在read_csv
中设置索引,则需要添加set_index
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""id,code
1,a
2,b
3,c
4,d"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
file1 = pd.read_csv(StringIO(temp))
print (file1)
temp=u"""
no,count,sum,class
3,567,55562,Y
5,673,66259,L
1,674,78256,Y
4,344,56789,Y"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
file2 = pd.read_csv(StringIO(temp))
print (file2)
temp=u"""
record,mean,median
3,5437,553
2,67233,664
1,67234,785
4,34423,556"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
file3 = pd.read_csv(StringIO(temp))
print (file3)
df=pd.concat([file1.set_index('id'), file2.set_index('no'),file3.set_index('record')],axis=1)
.drop("class", axis=1)
print (df)
code count sum mean median
1 a 674.0 78256.0 67234.0 785.0
2 b NaN NaN 67233.0 664.0
3 c 567.0 55562.0 5437.0 553.0
4 d 344.0 56789.0 34423.0 556.0
5 NaN 673.0 66259.0 NaN NaN
或者对于内部联接,将join='inner'
添加到concat
:
df = pd.concat([file1.set_index('id'),
file2.set_index('no'),
file3.set_index('record')], join='inner', axis=1).drop("class", axis=1)
print (df)
code count sum mean median
3 c 567 55562 5437 553
1 a 674 78256 67234 785
4 d 344 56789 34423 556