Question

我有一个pandas数据框，我需要根据以下条件从列的每一行中提取子字符串

我们有start_list（'one','once I','he'）和end_list（'fine','one','well'）。
子字符串前面应该是start_list。
子字符串可以由end_list。
当start_list的任何元素可用时，应使用/不存在end_list的元素来提取后续子字符串。

示例问题：

df = pd.DataFrame({'a' : ['one was fine today', 'we had to drive', ' ','I 
                     think once I was fine eating ham ', 'he studies really 
                     well 
                     and is polite ', 'one had to live well and prosper', 
                     '43948785943one by onej89044809', '827364hjdfvbfv', 
                     '&^%$&*+++===========one kfnv dkfjn uuoiu fine', 'they 
                     is one who makes me crazy'], 
                  'b' : ['11', '22', '33', '44', '55', '66', '77', '', '88', 
                     '99']})

预期结果：

df = pd.DataFrame({'a' : ['was', '','','was ','studies really','had to live',
                       'by','','kfnv dkfjn uuoiu','who makes me crazy'],
                   'b' : ['11', '22', '33', '44', '55', '66', '77', '', 
                       '88','99']})

Answer 1

我认为这对你有用。此解决方案当然需要 Pandas 以及内置库 functools 。

功能：remove_preceders

此功能将单词start_list和 str string的集合作为输入。它会查看start_list中的任何项目是否在string中，如果是，则仅返回在所述项目之后发生的string项目。否则，它将返回原始string。

def remove_preceders(start_list, string):
    for word in start_list:
        if word in string:
            string = string[string.find(word) + len(word):]
    return string

功能：remove_succeders

此函数与第一个函数非常相似，只是它只返回在string中的项目之前发生的end_list。

def remove_succeeders(end_list, string):
    for word in end_list:
        if word in string:
            string = string[:string.find(word)]
    return string

功能：to_apply

你如何实际运行上述功能？ apply 方法允许您在 DataFrame 或 Series 上运行复杂功能，但它会查找完整行或单行输入值，分别（根据您是否在DF或S上运行）。

此功能将运行功能作为输入。要检查的单词集合，我们可以使用它来运行上述两个函数：

def to_apply(func, words_to_check):
    return functools.partial(func, words_to_check)

如何运行

df['no_preceders'] = df.a.apply(
                         to_apply(remove_preceders, 
                                 ('one', 'once I', 'he'))
                               )
df['no_succeders'] = df.a.apply(
                          to_apply(remove_succeeders, 
                                  ('fine', 'one', 'well'))
                               )
df['substring'] = df.no_preceders.apply(
                          to_apply(remove_succeeders, 
                                  ('fine', 'one', 'well'))
                               )

然后，最后一步是从substring列移除未受过滤影响的项目：

def final_cleanup(row):
    if len(row['a']) == len(row['substring']):
        return ''
    else:
        return row['substring']

df['substring'] = df.apply(final_cleanup, axis=1)

<强>结果

希望这有效。

在pandas数据帧中的多个单词之间提取子字符串

1 个答案: