Question

这是此stackoverflow问题的后续操作

Pandas: How to return rows where a column has a line breaks/new line ( \n ) with one of several case-sensitive words coming directly after?

哪种方法可以返回包含几个区分大小写的单词之一的行，并在换行符'\ n'之后。

我现在想返回的行包含换行后包含最少数量的这些区分大小写的单词。

在下面的最小示例中，我试图从特定集合中获取至少包含三个字符串的行。

testdf = pd.DataFrame([
    [ ' generates the final summary. \nRESULTS \nMethods We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nRESULTS\n BACKGROUND teamed up to find some food'], 
                       ['anthropology with RESULTS \n\n\nMETHODS\n pharmacology and biology'],
    [ ' generates the final summary. \nMethods \nBACKGROUND We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nMETHODS\n teamed up to find some food'], 
                       ['anthropology with METHODS pharmacology and biology'],
        [ ' generates the final summary. \nBACKGROUND We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nBackground\n teamed up to find some food'], 
                       ['anthropology with \nBACKGROUND with \nRESULTS pharmacology and biology'],
    [ ' generates the final summary. \nBACKGROUND We \nRESULTS  evaluate \nCONCLUSIONS the performance of ', ]  
])
testdf.columns = ['A']
testdf.head(10)

返回

A
0   generates the final summary. \nRESULTS \nMethods We evaluate the performance of
1   the cat and bat \n\n\nRESULTS\n BACKGROUND teamed up to find some food
2   anthropology with RESULTS \n\n\nMETHODS\n pharmacology and biology
3   generates the final summary. \nMethods \nBACKGROUND We evaluate the performance of
4   the cat and bat \n\n\nMETHODS\n teamed up to find some food
5   anthropology with METHODS pharmacology and biology
6   generates the final summary. \nBACKGROUND We evaluate the performance of
7   the cat and bat \n\n\nBackground\n teamed up to find some food
8   anthropology with \nBACKGROUND with \nRESULTS pharmacology and biology
9   generates the final summary. \nBACKGROUND We \nRESULTS evaluate \nCONCLUSIONS the performance of

然后

listStrings = { '\nRESULTS',  '\nMETHODS' ,  '\nBACKGROUND' , '\nCONCLUSIONS', '\nEXPERIMENT'}
testdf.loc[testdf.A.apply(lambda x: len(listStrings.intersection(x.split())) >= 3)]

将不返回任何内容。

所需结果将仅返回最后一行。

9   generates the final summary. \nBACKGROUND We \nRESULTS evaluate \nCONCLUSIONS the performance of

因为这是唯一一行，其中至少包含3个指定的区分大小写的单词，并紧随新行。

Answer 1

使用str.findall

进行检查

testdf[testdf.A.str.findall('|'.join(listStrings)).str.len()>=3]
                                                   A
9   generates the final summary. \nBACKGROUND We ...

Answer 2

使用str.findall：

>>> testdf[testdf['A'].str.findall('|'.join(listStrings)).map(len)>=3]
                                                   A
9   generates the final summary. \nBACKGROUND We ...
>>>

熊猫：返回的行包含最少数量的区分大小写的单词，并且每个单词都以换行符（'\ n'）

2 个答案: