使用pandas:如果在一行中,列中的单词不会出现在其他列的字符串中,则删除行

时间:2015-08-20 10:28:26

标签: python pandas dataframe

假设我们有这个数据框:

from pandas import *

d = {'one' : Series(["word", "other-word", "banana", "hello"]),
    'two' : Series(["I like that word", "Have you seen other-word", "do you like bananas", "hello-kitty doll"])}

df = DataFrame(d)

如何删除one中未出现two的行?例如,在第三行bananabananas不匹配:删除行。在第四个:hellohello-kitty不匹配:drop。最后一个很重要:使用连字符-构建的化合物是障碍物。

预期产出:

          one                       two
0        word          I like that word
1  other-word  Have you seen other-word

2 个答案:

答案 0 :(得分:2)

编辑

另一种方法是计算要删除的索引列表并将它们存储在列表中,然后最后使用DataFrame.drop()。示例/演示 -

In [45]: dropseries = []

In [46]: for i, row in df.iterrows():
   ....:     if row['one'] not in row['two'].split():
   ....:         dropseries.append(i)
   ....:

In [47]: df.drop(dropseries)
Out[47]:
          one                       two
0        word          I like that word
1  other-word  Have you seen other-word

我不确定是否有更好的方法可以执行此操作,但您可以迭代每一行,然后在two列中拆分字符串,然后检查列one中的字符串是否存在是否在其中,然后追加与新数据帧匹配的行。

示例 -

newdf = pd.DataFrame()

for i, row in df.iterrows():
    if row['one'] in row['two'].split():
        newdf = newdf.append(row)

示例/演示 -

In [38]: newdf = pd.DataFrame()

In [39]: for i, row in df.iterrows():
   ....:     if row['one'] in row['two'].split():
   ....:         newdf = newdf.append(row)
   ....:

In [40]: newdf
Out[40]:
          one                       two
0        word          I like that word
1  other-word  Have you seen other-word

答案 1 :(得分:2)

你可以这样做:

result = []
for x, y in zip(df.one, df.two):
    if x in y.split():
        result.append(True)
        continue
    result.append(False)

print df[result]

更好的方法:

df[[ x in y.split() for x, y in zip(df.one, df.two) ]]