Pandas合并特定列的数据

时间:2016-05-26 16:27:27

标签: python csv pandas

我正在尝试合并数据以创建新的数据帧。我有两个数据帧,我试图将KEYS值打印到第三个数据帧,如果它等于以下列中的任何数字

df2          
                 0           1           2           3           4   \
KEYS                                                                  
FIT-3982  2024.0016   0101.0007        None        None        None   
FIT-3980  1140.0107        None        None        None        None   
FIT-3979  1907.0007   1907.0012   1907.0019   1907.0020   1907.0021   
FIT-3975  0117.0002   0117.0008        None        None        None   
FIT-3974  3004.0130        None        None        None        None   
FIT-3970  0114.0001   0114.0002   0101.0010   0114.0004   0114.0005   
FIT-3967  0113.0001    0113.009        None        None        None   
FIT-3964  1901.0017   1901.0019   0101.0005   1906.0015   1906.0028   
FIT-3963  1801.0038   0101.0002   1803.0020   1803.0021   1805.0020   
FIT-3960  0104.0001   0104.0009   0104.0014   0104.0015   0104.0016   

这是df1

                                       ID     TC_NUM
0  dialog_testcase_0101.0001_greeting.xml  0101.0001
1  dialog_testcase_0101.0002_greeting.xml  0101.0002
2  dialog_testcase_0101.0003_greeting.xml  0101.0003
3  dialog_testcase_0101.0004_greeting.xml  0101.0004
4  dialog_testcase_0101.0005_greeting.xml  0101.0005
5  dialog_testcase_0101.0006_greeting.xml  0101.0006
6  dialog_testcase_0101.0007_greeting.xml  0101.0007
7  dialog_testcase_0101.0008_greeting.xml  0101.0008
8  dialog_testcase_0101.0009_greeting.xml  0101.0009
9  dialog_testcase_0101.0010_greeting.xml  0101.0010

WHAT I WANT

df3-final          
                                       ID     TC_NUM   KEYS
0  dialog_testcase_0101.0001_greeting.xml  0101.0002  FIT-3963
1  dialog_testcase_0101.0002_greeting.xml  0101.0003
2  dialog_testcase_0101.0003_greeting.xml  0101.0004
3  dialog_testcase_0101.0004_greeting.xml  0101.0005  FIT-3964
4  dialog_testcase_0101.0005_greeting.xml  0101.0006
5  dialog_testcase_0101.0006_greeting.xml  0101.0007  FIT-3982
6  dialog_testcase_0101.0007_greeting.xml  0101.0008
7  dialog_testcase_0101.0008_greeting.xml  0101.0009
8  dialog_testcase_0101.0009_greeting.xml  0101.0010
9  dialog_testcase_0101.0010_greeting.xml  0101.0011
到目前为止

代码..

df1 = pd.read_csv('csv1.csv')


df2 = pd.read_csv('InitialQuerydataOpen.csv')

print df2.head(10)

df2.set_index('KEYS',inplace=True)

#change separator from `, ` to `,` (removed space)
#df2 = df2.TC_NUM.str[3:].str.split(',', expand=True).unstack().reset_index(drop=True, level=0).reset_index(name='TC_NUM')
df2 = df2.TC_NUM.str[3:].str.split(',', expand=True)



mergedOpen = pd.merge(df1, df2, on='df1[TC_NUM]', how='left')
print mergedOpen

2 个答案:

答案 0 :(得分:3)

您可以stack df2,进行一些快速格式化,然后使用merge

df2 = df2.stack().reset_index(level=0).rename(columns={0: 'TC_NUM'})
result = df1.merge(df2, how='left', on=['TC_NUM'])'

结果输出:

                                       ID     TC_NUM      KEYS
0  dialog_testcase_0101.0001_greeting.xml  0101.0001       NaN
1  dialog_testcase_0101.0002_greeting.xml  0101.0002  FIT-3963
2  dialog_testcase_0101.0003_greeting.xml  0101.0003       NaN
3  dialog_testcase_0101.0004_greeting.xml  0101.0004       NaN
4  dialog_testcase_0101.0005_greeting.xml  0101.0005  FIT-3964
5  dialog_testcase_0101.0006_greeting.xml  0101.0006       NaN
6  dialog_testcase_0101.0007_greeting.xml  0101.0007  FIT-3982
7  dialog_testcase_0101.0008_greeting.xml  0101.0008       NaN
8  dialog_testcase_0101.0009_greeting.xml  0101.0009       NaN
9  dialog_testcase_0101.0010_greeting.xml  0101.0010  FIT-3970

使用merge似乎效率更高:

%timeit df1.merge(df2.stack().reset_index(level=0).rename(columns={0:'TC_NUM'}), how='left', on=['TC_NUM'])
100 loops, best of 3: 3.84 ms per loop

%timeit df1.apply(lambda x: get_key(df2, x.TC_NUM), axis=1)
100 loops, best of 3: 10.4 ms per loop

答案 1 :(得分:1)

设置

import pandas as pd
from StringIO import StringIO

text2 = """KEYS                 0           1           2           3           4 
FIT-3982  2024.0016   0101.0007        None        None        None   
FIT-3980  1140.0107        None        None        None        None   
FIT-3979  1907.0007   1907.0012   1907.0019   1907.0020   1907.0021   
FIT-3975  0117.0002   0117.0008        None        None        None   
FIT-3974  3004.0130        None        None        None        None   
FIT-3970  0114.0001   0114.0002   0101.0010   0114.0004   0114.0005   
FIT-3967  0113.0001    0113.009        None        None        None   
FIT-3964  1901.0017   1901.0019   0101.0005   1906.0015   1906.0028   
FIT-3963  1801.0038   0101.0002   1803.0020   1803.0021   1805.0020   
FIT-3960  0104.0001   0104.0009   0104.0014   0104.0015   0104.0016   """

df2 = pd.read_csv(StringIO(text2), delim_whitespace=True, index_col=0, dtype=str)

df2[df2 == 'None'] = None

text1 = """                                       ID     TC_NUM
0  dialog_testcase_0101.0001_greeting.xml  0101.0001
1  dialog_testcase_0101.0002_greeting.xml  0101.0002
2  dialog_testcase_0101.0003_greeting.xml  0101.0003
3  dialog_testcase_0101.0004_greeting.xml  0101.0004
4  dialog_testcase_0101.0005_greeting.xml  0101.0005
5  dialog_testcase_0101.0006_greeting.xml  0101.0006
6  dialog_testcase_0101.0007_greeting.xml  0101.0007
7  dialog_testcase_0101.0008_greeting.xml  0101.0008
8  dialog_testcase_0101.0009_greeting.xml  0101.0009
9  dialog_testcase_0101.0010_greeting.xml  0101.0010"""

df1 = pd.read_csv(StringIO(text1), delim_whitespace=True, dtype=str)

解决方案

def get_key(df2, tc_num):
    df2test = (df2 == tc_num).any(axis=1)
    df2test = df2test[df2test]
    if not df2test.empty:
        return df2test.index[0]

df1['keys'] = df1.apply(lambda x: get_key(df2, x.TC_NUM), axis=1)

print df1

                                       ID     TC_NUM      keys
0  dialog_testcase_0101.0001_greeting.xml  0101.0001      None
1  dialog_testcase_0101.0002_greeting.xml  0101.0002  FIT-3963
2  dialog_testcase_0101.0003_greeting.xml  0101.0003      None
3  dialog_testcase_0101.0004_greeting.xml  0101.0004      None
4  dialog_testcase_0101.0005_greeting.xml  0101.0005  FIT-3964
5  dialog_testcase_0101.0006_greeting.xml  0101.0006      None
6  dialog_testcase_0101.0007_greeting.xml  0101.0007  FIT-3982
7  dialog_testcase_0101.0008_greeting.xml  0101.0008      None
8  dialog_testcase_0101.0009_greeting.xml  0101.0009      None
9  dialog_testcase_0101.0010_greeting.xml  0101.0010  FIT-3970

解释

  • 确保dtypes是字符串或对象。 dtype=str。可以做df1.astype(str)
  • 使用any(axis=1)检查字符串是否在任何列中。