我有一个问题,我正在工作好几天。我有2个数据帧,如下所示。索引定义是指一组TEST,NAME和SEQUENCE三元组,它们是唯一的。目标是获得指数'匹配的索引文件中的值:
其中一个序列值,例如配置中的[111,222,333](例如可以是111 或 222 或 333)和测试和 NAME。
配置文件是重要的,目的是找到与之对应的索引值。配置中不存在的任何内容都应显示在输出文件中。我希望得到一个最终输出,包括:INDEX,TEST,NAME和SEQUENCE。所以最终输出将是配置文件的一个子集,但它只包含一个SEQUENCE(而不是3)和相应的TEST,NAME和INDEX。如:
示例输出文件:
index TEST NAME SEQUENCE
901922 A john 111
238394 C ashley 555
930293 B sam 444
我试过写一个for循环,但是索引没有成功,如下所示:
for x in range(0, config.shape[0]):
find1=eval(config.SEQUENCE[x])
find1='|'.join(str(i) for i in find1)
find1 = '(' + find1 + ')'
第一个数据帧: config
SEQUENCE TEST NAME
[111,222,333] A john
[222,444,888] B sam
[111,222,333] A ashley
[999,777,555] C ashley
[111,222,333] D john
[111,222,333] A john
G kelly
第二个数据帧:索引
index TEST NAME SEQUENCE
901922 A john 111
930293 B sam 444
238203 A ashley 888
238394 C ashley 555
483472 D john 777
901922 A john 111
264225 F greg 111
465126 A mary 555
554216 B peter 333
答案 0 :(得分:2)
执行此操作的一种方法是首先在test
和name
上加入两个表,然后删除未在sequence
中找到index
的行config
:
ind2 = index.set_index(['test', 'name'])
out = config.join(ind2, ['test', 'name'], 'left', lsuffix='_config', rsuffix='_index')
out['sequence_config'] = out.apply(lambda x: x['sequence_index'] in x['sequence_config'] if x['sequence_config'] is not None else False, axis=1)
out = out[out['sequence_config']].set_index('index').drop_duplicates().drop(
'sequence_config', axis=1).rename(columns={'sequence_index': 'sequence'})
这给出了:
name test sequence
index
901922.0 john A 111.0
930293.0 sam B 444.0
238394.0 ashley C 555.0
答案 1 :(得分:0)
# Your DataFrame contains a column of strings that look like lists,
# but we want to work with a column of actual Python lists.
# Convert strings to lists with this:
from ast import literal_eval
config['SEQUENCE'] = config['SEQUENCE'].apply(literal_eval)
# Split these newly formed lists into separate columns
split = pd.concat([pd.DataFrame(config.SEQUENCE.values.tolist()),
config[['TEST', 'NAME']]], axis=1)
split
0 1 2 TEST NAME
0 111 222 333 A john
1 222 444 888 B sam
2 111 222 333 A ashley
3 999 777 666 C ashley
4 111 222 333 D john
5 111 222 333 A john
# Melt or "unpivot" the DF so that each row holds only one sequence
melted = split.melt(id_vars=['TEST', 'NAME'],
value_name='SEQUENCE').drop('variable', axis=1)
melted
TEST NAME SEQUENCE
0 A john 111
1 B sam 222
2 A ashley 111
3 C ashley 999
4 D john 111
5 A john 111
6 A john 222
7 B sam 444
8 A ashley 222
9 C ashley 777
10 D john 222
11 A john 222
12 A john 333
13 B sam 888
14 A ashley 333
15 C ashley 666
16 D john 333
17 A john 333
# Default behaviour of pd.merge gives us what we want!
# Note that the duplicate row arises from the duplicate in the DF named index.
pd.merge(melted, index)
TEST NAME SEQUENCE index
0 A john 111 901922
1 A john 111 901922
2 B sam 444 930293
pd.merge(melted, index)['index'].unique()
array([901922, 930293])