Question

我有大量的数据。像100k行的东西，如果包含列表的行包含来自另一个数据帧的值，我试图从数据帧中删除一行。这是一个很小的例子。

has = [['@a'], ['@b'], ['#c, #d, #e, #f'], ['@g']]
use = [1,2,3,5]
z = ['#d','@a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})

              tweet  user
0              [@a]     1
1              [@b]     2
2  [#c, #d, #e, #f]     3
3              [@g]     5
    z
0  #d
1  @a

期望的结果将是

              tweet  user
0              [@b]     2
1              [@g]     5

我尝试的事情

#this seems to work for dropping @a but not #d
for a in range(df.tweet.size):
    for search in df2.z:
        if search in df.loc[a].tweet:
            df.drop(a)

#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]

#the error being "unterminated character set at position 1343770" 
#i went to check what was on that line and it returned this  
basket.iloc[1343770]

user_id                                 17060480
tweet      [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object

非常感谢任何帮助。

Answer 1

是['#c, #d, #e, #f'] 1个字符串还是像['#c', '#d', '#e', '#f']这样的列表？

has = [['@a'], ['@b'], ['#c', '#d', '#e', '#f'], ['@g']]
use = [1,2,3,5]
z = ['#d','@a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})

简单的解决方案是

screen = set(df2.z.tolist())
to_delete = list()  # this will speed things up doing only 1 delete
for id, row in df.iterrows():
    if set(row.tweet).intersection(screen):
        to_delete.append(id)
df.drop(to_delete, inplace=True)

速度比较（10 000行）：

st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
    if set(row.tweet).intersection(screen):
        to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258

st = time.time()
for a in df.tweet.index:
    for search in df2.z:
        if search in df.loc[a].tweet:
            df.drop(a, inplace=True)
            break
print(time.time()-st)
43.99799990653992

Answer 2

对我来说，如果我做了一些调整，你的代码就可以了。

首先，在放置range(df.tweet.size)时，你错过了最后一行，要么增加这个，要么（如果你没有增加索引，那就更健壮了），使用df.tweet.index。

其次，你没有应用你的删除，请使用inplace=True。

第三，您在字符串中有#d，以下不是列表：'#c, #d, #e, #f'，您必须将其更改为列表才能生效。

因此，如果您更改它，以下代码可以正常工作：

has = [['@a'], ['@b'], ['#c', '#d', '#e', '#f'], ['@g']]
use = [1,2,3,5]
z = ['#d','@a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})

for a in df.tweet.index:
    for search in df2.z:
        if search in df.loc[a].tweet:
            df.drop(a, inplace=True)
            break  # so if we already dropped it we no longer look whether we should drop this line

这将提供所需的结果。请注意，由于缺少矢量化，这可能不是最佳的。

编辑：

您可以使用以下内容将字符串作为列表：

from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))

这将函数应用于每一行（假设每行包含一个包含一个或多个元素的列表）：用逗号将每个元素（应该是一个字符串）拆分成一个新列表，并在一行中“展平”所有列表（如果有多个人在一起。

EDIT2：

是的，这不是真正的高效但基本上做了所要求的。记住这一点并在工作之后，尝试改进代码（减少迭代次数，做收集索引然后删除所有索引的技巧）。

pandas - 如果包含列表，则删除带有值列表的行

2 个答案:

编辑：

EDIT2：