Question

我有一个数据框A，其中有一列称为text的列，它们是长字符串。我想保留字符串列表“ author_id”中具有任何字符串的“ A”行。

A data frame:
Dialogue Index  author_id   text
10190       0    573660    How is that even possible?
10190       1    23442     @573660 I do apologize. 
10190       2    573661    @AAA do you still have the program for free checked bags? 

author_id list:
[573660, 573678, 5736987]

因此，由于573660在author_id列表中并且在A的文本列中，所以我的预期结果是仅保留数据框A的第二行：

 Dialogue   Index   author_id   text
 10190        1       23442     @573660 I do apologize.

我能想到的最简单的解决方法是：

 new_A=pd.DataFrame()   
 for id in author_id:
      new_A.append(A[A['text'].str.contains(id, na=False)]

但这会花费很长时间。

所以我想出了这个解决方案：

[id in text for id in author_id for text in df['text'] ]

但这不适用于子集数据帧，因为对于每个作者ID，我都为df ['text']中的所有字符串获取了真假值。

因此，我在数据框中创建了一个新列，该列是Dialogue和Index的组合，因此我可以在列表理解中返回该列，但是它给出了一个我不知道如何解释的错误。

A["DialogueIndex"]= df["Dialogue"].map(str) + df["Index"]

newA = [did for did in df["DialogueIndex"]  for id in author_id if df['text'].str.contains(id)  ]

error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

请帮助。

Answer 1

只需使用str.contains来查看text是否包含您指定列表中的任何作者（通过将所有作者与|一起加入）

import pandas as pd
df = pd.DataFrame({
    'Dialogue': [10190, 10190, 10190],
    'Index': [0,1,2],
    'author_id': [573660,23442,573661],
    'text': ['How is that even possible?', 
             '@573660 I do apologize.',
            '@AAA do you still have the program for free checked bags?']
})
author_id_list = [573660, 573678, 5736987]

df.text.str.contains('|'.join(list(map(str, author_id_list))))
#0    False
#1     True
#2    False
#Name: text, dtype: bool

然后，您可以掩盖原始的DataFrame：

df[df.text.str.contains('|'.join(list(map(str, author_id_list))))]
#   Dialogue  Index  author_id                     text
#1     10190      1      23442  @573660 I do apologize.

如果您的author_id_list已经是字符串，那么您可以摆脱list(map(...))并加入原始列表。

Answer 2

您可以使用apply然后检查author_id_list中的每个项目是否在文本中

df[df.text.apply(lambda x: any(str(e) in x for e in author_id_list))]


Dialogue    Index   author_id   text
1   10190   1   23442   @573660 I do apologize.

也许有一种更快的方法，但是我相信这会为您找到想要的答案

使用列表理解的子集熊猫数据框

2 个答案: