在两个数据帧中查找匹配的子字符串

时间:2017-08-02 19:49:41

标签: python pandas string-matching

我有两个像这样的数据框:

[in]print(training_df.head(n=10))

[out]
                          product_id
transaction_id                      
0000001                   [P06, P09]
0000002         [P01, P05, P06, P09]
0000003                   [P01, P06]
0000004                   [P01, P09]
0000005                   [P06, P09]
0000006                   [P02, P09]
0000007         [P01, P06, P09, P10]
0000008                   [P03, P05]
0000009                   [P03, P09]
0000010         [P03, P05, P06, P09]

[in]print(testing_df.head(n=10))

[out]
                     product_id
transaction_id                 
001                       [P01]
002                  [P01, P02]
003             [P01, P02, P09]
004                  [P01, P03]
005             [P01, P03, P05]
006             [P01, P03, P07]
007             [P01, P03, P08]
008                  [P01, P04]
009             [P01, P04, P05]
010             [P01, P04, P08]

testing_df中的每一行都是一个可能的"子串" training_df中的一行。我想查找所有匹配项并返回testing_df中每个列表的可能training_df列表。如果我可以返回一个字典,其中的键是来自testing_df的transaction_id,并且值都是可能的"匹配"在training_df中。 (training_df中的每个列表应该比test_df中的相应列表长一个值)。

我试过了:

# Find the substrings that match
matches = []

for string in training_df:
    results = []
    for substring in testing_df:
        if substring in string:
            results.append(substring)
    if results:
        matches.append(results)  

但是这不起作用,它只返回列名' product_id'。

我也尝试过:

# Initialize a list to store the matches between incomplete testing_df and training_df
matches = {}

# Compare the "incomplete" testing lists to the training set
for line in testing_df.product_id:
    for line in training_df.product_id:
        if line in testing_df.product_id in line in training_df.product_id:
            matches[line] = training_df[training_df.product_id.str.contains(line)]

然而,这会引发错误TypeError: unhashable type: 'list'

1 个答案:

答案 0 :(得分:1)

我认为问题是括号。问题是in检查元素是否在列表中,而不是一个列表是否是另一个列表的子集。您可以将两个列表转换为集合,然后检查它们是否是彼此的子集。您还可以使用高级索引来保留transaction_id

training_df = pd.DataFrame([
    ['0000001', ['P06', 'P09']],
    ['0000002', ['P01', 'P05', 'P06', 'P09']],
    ['0000003', ['P01', 'P06']],
    ['0000004', ['P01', 'P09']],
    ['0000005', ['P06', 'P09']],
    ['0000006', ['P02', 'P09']],
    ['0000007', ['P01', 'P06', 'P09', 'P10']],
    ['0000008', ['P03', 'P05']],
    ['0000009', ['P03', 'P09']],
    ['0000010', ['P03', 'P05', 'P06', 'P09']],
], columns=['transaction_id', 'product_id'])

testing_df = pd.DataFrame([
    ['001', ['P01']],
    ['002', ['P01', 'P02']],
    ['003', ['P01', 'P02', 'P09']],
    ['004', ['P01', 'P03']],
    ['005', ['P01', 'P03', 'P05']],
    ['006', ['P01', 'P03', 'P07']],
    ['007', ['P01', 'P03', 'P08']],
    ['008', ['P01', 'P04']],
    ['009', ['P01', 'P04', 'P05']],
    ['010', ['P01', 'P04', 'P08']],
], columns=['transaction_id', 'product_id'])

matches = {}
for testing_id in testing_df.product_id:
    testing_id_set = set(testing_id)
    contains_id = training_df.product_id.apply(lambda id: testing_id_set.issubset(set(id)))
    matches[str(testing_id)] = contains_id