Question

我有一个问题，我正在工作好几天。我有2个数据帧，如下所示。索引定义是指一组TEST，NAME和SEQUENCE三元组，它们是唯一的。目标是获得指数＆＃39;匹配的索引文件中的值：

其中一个序列值，例如配置中的[111,222,333]（例如可以是111 或 222 或 333）和测试和 NAME。

配置文件是重要的，目的是找到与之对应的索引值。配置中不存在的任何内容都应显示在输出文件中。我希望得到一个最终输出，包括：INDEX，TEST，NAME和SEQUENCE。所以最终输出将是配置文件的一个子集，但它只包含一个SEQUENCE（而不是3）和相应的TEST，NAME和INDEX。如：

示例输出文件：

index   TEST    NAME    SEQUENCE
901922  A       john       111
238394  C       ashley     555
930293  B       sam        444

我试过写一个for循环，但是索引没有成功，如下所示：

for x in range(0, config.shape[0]):
    find1=eval(config.SEQUENCE[x])
    find1='|'.join(str(i) for i in find1)  
    find1 = '(' + find1 + ')'

第一个数据帧： config

SEQUENCE    TEST    NAME
[111,222,333]   A   john
[222,444,888]   B   sam
[111,222,333]   A   ashley
[999,777,555]   C   ashley
[111,222,333]   D   john
[111,222,333]   A   john
                G   kelly

第二个数据帧：索引

index   TEST    NAME    SEQUENCE
901922  A       john       111
930293  B       sam        444
238203  A       ashley     888
238394  C       ashley     555
483472  D       john       777
901922  A       john       111
264225  F       greg       111
465126  A       mary       555
554216  B       peter      333

Answer 1

执行此操作的一种方法是首先在test和name上加入两个表，然后删除未在sequence中找到index的行config：

ind2 = index.set_index(['test', 'name'])
out = config.join(ind2, ['test', 'name'], 'left', lsuffix='_config', rsuffix='_index')
out['sequence_config'] = out.apply(lambda x: x['sequence_index'] in x['sequence_config'] if x['sequence_config'] is not None else False, axis=1)

out = out[out['sequence_config']].set_index('index').drop_duplicates().drop(
    'sequence_config', axis=1).rename(columns={'sequence_index': 'sequence'})

这给出了：

            name test  sequence
index                          
901922.0    john    A     111.0
930293.0     sam    B     444.0
238394.0  ashley    C     555.0

Answer 2

# Your DataFrame contains a column of strings that look like lists,
# but we want to work with a column of actual Python lists.
# Convert strings to lists with this:
from ast import literal_eval
config['SEQUENCE'] = config['SEQUENCE'].apply(literal_eval)

# Split these newly formed lists into separate columns
split = pd.concat([pd.DataFrame(config.SEQUENCE.values.tolist()), 
                   config[['TEST', 'NAME']]], axis=1)
split
     0    1    2 TEST    NAME
0  111  222  333    A    john
1  222  444  888    B     sam
2  111  222  333    A  ashley
3  999  777  666    C  ashley
4  111  222  333    D    john
5  111  222  333    A   john


# Melt or "unpivot" the DF so that each row holds only one sequence
melted = split.melt(id_vars=['TEST', 'NAME'], 
                    value_name='SEQUENCE').drop('variable', axis=1)
melted
   TEST    NAME  SEQUENCE
0     A    john       111
1     B     sam       222
2     A  ashley       111
3     C  ashley       999
4     D    john       111
5     A   john       111
6     A    john       222
7     B     sam       444
8     A  ashley       222
9     C  ashley       777
10    D    john       222
11    A   john       222
12    A    john       333
13    B     sam       888
14    A  ashley       333
15    C  ashley       666
16    D    john       333
17    A   john       333


# Default behaviour of pd.merge gives us what we want!
# Note that the duplicate row arises from the duplicate in the DF named index.
pd.merge(melted, index)
  TEST  NAME  SEQUENCE   index
0    A  john       111  901922
1    A  john       111  901922
2    B   sam       444  930293


pd.merge(melted, index)['index'].unique()
array([901922, 930293])

Python中的数据帧值匹配

2 个答案: