我有两个像这样的数据框:
[in]print(training_df.head(n=10))
[out]
product_id
transaction_id
0000001 [P06, P09]
0000002 [P01, P05, P06, P09]
0000003 [P01, P06]
0000004 [P01, P09]
0000005 [P06, P09]
0000006 [P02, P09]
0000007 [P01, P06, P09, P10]
0000008 [P03, P05]
0000009 [P03, P09]
0000010 [P03, P05, P06, P09]
[in]print(testing_df.head(n=10))
[out]
product_id
transaction_id
001 [P01]
002 [P01, P02]
003 [P01, P02, P09]
004 [P01, P03]
005 [P01, P03, P05]
006 [P01, P03, P07]
007 [P01, P03, P08]
008 [P01, P04]
009 [P01, P04, P05]
010 [P01, P04, P08]
testing_df中的每一行都是一个可能的"子串" training_df中的一行。我想查找所有匹配项并返回testing_df中每个列表的可能training_df列表。如果我可以返回一个字典,其中的键是来自testing_df的transaction_id,并且值都是可能的"匹配"在training_df中。 (training_df中的每个列表应该比test_df中的相应列表长一个值)。
我试过了:
# Find the substrings that match
matches = []
for string in training_df:
results = []
for substring in testing_df:
if substring in string:
results.append(substring)
if results:
matches.append(results)
但是这不起作用,它只返回列名' product_id'。
我也尝试过:
# Initialize a list to store the matches between incomplete testing_df and training_df
matches = {}
# Compare the "incomplete" testing lists to the training set
for line in testing_df.product_id:
for line in training_df.product_id:
if line in testing_df.product_id in line in training_df.product_id:
matches[line] = training_df[training_df.product_id.str.contains(line)]
然而,这会引发错误TypeError: unhashable type: 'list'
答案 0 :(得分:1)
我认为问题是括号。问题是in
检查元素是否在列表中,而不是一个列表是否是另一个列表的子集。您可以将两个列表转换为集合,然后检查它们是否是彼此的子集。您还可以使用高级索引来保留transaction_id
:
training_df = pd.DataFrame([
['0000001', ['P06', 'P09']],
['0000002', ['P01', 'P05', 'P06', 'P09']],
['0000003', ['P01', 'P06']],
['0000004', ['P01', 'P09']],
['0000005', ['P06', 'P09']],
['0000006', ['P02', 'P09']],
['0000007', ['P01', 'P06', 'P09', 'P10']],
['0000008', ['P03', 'P05']],
['0000009', ['P03', 'P09']],
['0000010', ['P03', 'P05', 'P06', 'P09']],
], columns=['transaction_id', 'product_id'])
testing_df = pd.DataFrame([
['001', ['P01']],
['002', ['P01', 'P02']],
['003', ['P01', 'P02', 'P09']],
['004', ['P01', 'P03']],
['005', ['P01', 'P03', 'P05']],
['006', ['P01', 'P03', 'P07']],
['007', ['P01', 'P03', 'P08']],
['008', ['P01', 'P04']],
['009', ['P01', 'P04', 'P05']],
['010', ['P01', 'P04', 'P08']],
], columns=['transaction_id', 'product_id'])
matches = {}
for testing_id in testing_df.product_id:
testing_id_set = set(testing_id)
contains_id = training_df.product_id.apply(lambda id: testing_id_set.issubset(set(id)))
matches[str(testing_id)] = contains_id