我正在尝试合并数据以创建新的数据帧。我有两个数据帧,我试图将KEYS值打印到第三个数据帧,如果它等于以下列中的任何数字
df2
0 1 2 3 4 \
KEYS
FIT-3982 2024.0016 0101.0007 None None None
FIT-3980 1140.0107 None None None None
FIT-3979 1907.0007 1907.0012 1907.0019 1907.0020 1907.0021
FIT-3975 0117.0002 0117.0008 None None None
FIT-3974 3004.0130 None None None None
FIT-3970 0114.0001 0114.0002 0101.0010 0114.0004 0114.0005
FIT-3967 0113.0001 0113.009 None None None
FIT-3964 1901.0017 1901.0019 0101.0005 1906.0015 1906.0028
FIT-3963 1801.0038 0101.0002 1803.0020 1803.0021 1805.0020
FIT-3960 0104.0001 0104.0009 0104.0014 0104.0015 0104.0016
这是df1
ID TC_NUM
0 dialog_testcase_0101.0001_greeting.xml 0101.0001
1 dialog_testcase_0101.0002_greeting.xml 0101.0002
2 dialog_testcase_0101.0003_greeting.xml 0101.0003
3 dialog_testcase_0101.0004_greeting.xml 0101.0004
4 dialog_testcase_0101.0005_greeting.xml 0101.0005
5 dialog_testcase_0101.0006_greeting.xml 0101.0006
6 dialog_testcase_0101.0007_greeting.xml 0101.0007
7 dialog_testcase_0101.0008_greeting.xml 0101.0008
8 dialog_testcase_0101.0009_greeting.xml 0101.0009
9 dialog_testcase_0101.0010_greeting.xml 0101.0010
WHAT I WANT
df3-final
ID TC_NUM KEYS
0 dialog_testcase_0101.0001_greeting.xml 0101.0002 FIT-3963
1 dialog_testcase_0101.0002_greeting.xml 0101.0003
2 dialog_testcase_0101.0003_greeting.xml 0101.0004
3 dialog_testcase_0101.0004_greeting.xml 0101.0005 FIT-3964
4 dialog_testcase_0101.0005_greeting.xml 0101.0006
5 dialog_testcase_0101.0006_greeting.xml 0101.0007 FIT-3982
6 dialog_testcase_0101.0007_greeting.xml 0101.0008
7 dialog_testcase_0101.0008_greeting.xml 0101.0009
8 dialog_testcase_0101.0009_greeting.xml 0101.0010
9 dialog_testcase_0101.0010_greeting.xml 0101.0011
到目前为止代码..
df1 = pd.read_csv('csv1.csv')
df2 = pd.read_csv('InitialQuerydataOpen.csv')
print df2.head(10)
df2.set_index('KEYS',inplace=True)
#change separator from `, ` to `,` (removed space)
#df2 = df2.TC_NUM.str[3:].str.split(',', expand=True).unstack().reset_index(drop=True, level=0).reset_index(name='TC_NUM')
df2 = df2.TC_NUM.str[3:].str.split(',', expand=True)
mergedOpen = pd.merge(df1, df2, on='df1[TC_NUM]', how='left')
print mergedOpen
答案 0 :(得分:3)
您可以stack
df2
,进行一些快速格式化,然后使用merge
:
df2 = df2.stack().reset_index(level=0).rename(columns={0: 'TC_NUM'})
result = df1.merge(df2, how='left', on=['TC_NUM'])'
结果输出:
ID TC_NUM KEYS
0 dialog_testcase_0101.0001_greeting.xml 0101.0001 NaN
1 dialog_testcase_0101.0002_greeting.xml 0101.0002 FIT-3963
2 dialog_testcase_0101.0003_greeting.xml 0101.0003 NaN
3 dialog_testcase_0101.0004_greeting.xml 0101.0004 NaN
4 dialog_testcase_0101.0005_greeting.xml 0101.0005 FIT-3964
5 dialog_testcase_0101.0006_greeting.xml 0101.0006 NaN
6 dialog_testcase_0101.0007_greeting.xml 0101.0007 FIT-3982
7 dialog_testcase_0101.0008_greeting.xml 0101.0008 NaN
8 dialog_testcase_0101.0009_greeting.xml 0101.0009 NaN
9 dialog_testcase_0101.0010_greeting.xml 0101.0010 FIT-3970
使用merge
似乎效率更高:
%timeit df1.merge(df2.stack().reset_index(level=0).rename(columns={0:'TC_NUM'}), how='left', on=['TC_NUM'])
100 loops, best of 3: 3.84 ms per loop
%timeit df1.apply(lambda x: get_key(df2, x.TC_NUM), axis=1)
100 loops, best of 3: 10.4 ms per loop
答案 1 :(得分:1)
import pandas as pd
from StringIO import StringIO
text2 = """KEYS 0 1 2 3 4
FIT-3982 2024.0016 0101.0007 None None None
FIT-3980 1140.0107 None None None None
FIT-3979 1907.0007 1907.0012 1907.0019 1907.0020 1907.0021
FIT-3975 0117.0002 0117.0008 None None None
FIT-3974 3004.0130 None None None None
FIT-3970 0114.0001 0114.0002 0101.0010 0114.0004 0114.0005
FIT-3967 0113.0001 0113.009 None None None
FIT-3964 1901.0017 1901.0019 0101.0005 1906.0015 1906.0028
FIT-3963 1801.0038 0101.0002 1803.0020 1803.0021 1805.0020
FIT-3960 0104.0001 0104.0009 0104.0014 0104.0015 0104.0016 """
df2 = pd.read_csv(StringIO(text2), delim_whitespace=True, index_col=0, dtype=str)
df2[df2 == 'None'] = None
text1 = """ ID TC_NUM
0 dialog_testcase_0101.0001_greeting.xml 0101.0001
1 dialog_testcase_0101.0002_greeting.xml 0101.0002
2 dialog_testcase_0101.0003_greeting.xml 0101.0003
3 dialog_testcase_0101.0004_greeting.xml 0101.0004
4 dialog_testcase_0101.0005_greeting.xml 0101.0005
5 dialog_testcase_0101.0006_greeting.xml 0101.0006
6 dialog_testcase_0101.0007_greeting.xml 0101.0007
7 dialog_testcase_0101.0008_greeting.xml 0101.0008
8 dialog_testcase_0101.0009_greeting.xml 0101.0009
9 dialog_testcase_0101.0010_greeting.xml 0101.0010"""
df1 = pd.read_csv(StringIO(text1), delim_whitespace=True, dtype=str)
def get_key(df2, tc_num):
df2test = (df2 == tc_num).any(axis=1)
df2test = df2test[df2test]
if not df2test.empty:
return df2test.index[0]
df1['keys'] = df1.apply(lambda x: get_key(df2, x.TC_NUM), axis=1)
print df1
ID TC_NUM keys
0 dialog_testcase_0101.0001_greeting.xml 0101.0001 None
1 dialog_testcase_0101.0002_greeting.xml 0101.0002 FIT-3963
2 dialog_testcase_0101.0003_greeting.xml 0101.0003 None
3 dialog_testcase_0101.0004_greeting.xml 0101.0004 None
4 dialog_testcase_0101.0005_greeting.xml 0101.0005 FIT-3964
5 dialog_testcase_0101.0006_greeting.xml 0101.0006 None
6 dialog_testcase_0101.0007_greeting.xml 0101.0007 FIT-3982
7 dialog_testcase_0101.0008_greeting.xml 0101.0008 None
8 dialog_testcase_0101.0009_greeting.xml 0101.0009 None
9 dialog_testcase_0101.0010_greeting.xml 0101.0010 FIT-3970
dtype=str
。可以做df1.astype(str)
。any(axis=1)
检查字符串是否在任何列中。