我需要一些帮助来解决使用 python 和 Pandas 处理数据框的问题。
如果“full_data”中存在“data”的任何子集,我有 2 列,即“data”和“full_data”,那么我需要名为“new_finding”的新列中匹配的子集值
我需要一个新列“new_finding”的输出:
数据 | full_data | 新发现 |
---|---|---|
123456 | 123456789 | [123456] |
345643 | 456432345876 | [456,345,43] |
答案 0 :(得分:1)
看看这是否适合你
import re
from itertools import permutations
def combs(letters):
for n in range(1, len(letters)+1):
yield from map(''.join, permutations(letters, n))
df['new_finding'] = df.apply(lambda x: ([re.findall(comb,str(x['full_data'])) for comb in combs(str(x['data']))]),axis=1)
df['new_finding'] = df['new_finding'].apply(lambda row:[x for x in row if x != []])
df['new_finding'] = df['new_finding'].apply(lambda row:[list(x) for x in set(tuple(x) for x in row)])
df['new_finding'] = df['new_finding'].apply(lambda row:[item[0] for item in row])
df
输出
data full_data new_finding
123456 123456789 [45, 1234, 6, 23, 123456, 4, 123, 3456, 12, 5, 3, 12345, 23456, 1, 56, 2345, 234, 345, 2, 34, 456]
345643 456432345876 [345, 5, 564, 45, 45643, 6, 4, 34, 643, 43, 56, 4564, 5643, 456, 3, 64]