Question

问题：我有两个数据框，想要删除它们之间的任何重复/部分重复。

 DF1                 DF2

 **Phrases**         **Phrases**  
 Little Red          Little Red Corvette
 Grow Your           Grow Your Beans
 James Bond          James Dean
 Tom Brady

我想删除＆＃34; Little Red＆＃34;和＃34;成长你的＆＃34;来自DF1的短语，然后结合两个DF，使最终产品看起来像：

 DF3
 Little Red Corvette
 Grow Your Beans
 James Bond
 James Dean
 Tom Brady

只是一个注释，我只想删除DF1中的短语，如果所有单词出现在DF2的短语中（例如Little Red Vs. Little Red Corvette）。我不想删除詹姆斯邦德＆＃34;来自DF1，如果＆＃34; James Dean＆＃34;出现在DF2中。

Answer 1

我在下面找到了这个解决方案。现在，它不是很优雅，但它有效。

import pandas as pd

df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])

# For each element of df1, if we found a left(df2, len(df1)) = df1, we
# apply df1 = df2
# Remark that the column name is 0
for i in range(int(df1.count())):
    for j in range(int(df2.count())):
        if df1.loc[i, 0] == df2.loc[j, 0][:len(df1.loc[i, 0])]:
            df1.loc[i, 0] = df2.loc[j, 0]

# Finaly we merge df1 and df2 by union of the keys.
# Here the column name is 0
df3 = df2.merge(df1, how='outer', on=0, sort=True, copy=False)

DataFrame df3就是您所需要的。

Answer 2

排序后您可以bisect值：

import pandas as pd

df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])

from bisect import bisect_left

def find_common(df1, df2):
    vals = df2.values
    vals.sort(0)
    for i, row in df1.iterrows():
        val = row.values
        ind = bisect_left(vals, val, hi=len(vals) - 1)
        if val[0] not in vals[ind][0]:
            yield val[0]


df3 = df2.append(pd.DataFrame(find_common(df1, df2)),ignore_index=True)
print(df3)

输出：

                     0
0      Grow Your Beans
1           James Dean
2  Little Red Corvette
3           James Bond
4            Tom Brady

排序为您提供O(N log N)解决方案，而不是O(n^2)每次从df2检查字符串时迭代df1中的每个字符串

Answer 3

我首先会对数据帧进行外部合并。我不确定DF1是否引用了帖子中的列名或数据框可变名称，但为了简单起见，我假设您有两个数据框，其中包含带字符串的列：

df1 
#        words
#0  little red
#1   grow your
#2  james bond
#3  tom brandy

df2 
#                 words
#0  little red corvette
#1      grow your beans
#2           james dean
#3               little

接下来，创建一个合并这两个的新数据框（使用外部合并）。这照顾重复

df3 = pandas.merge( df1, df2, on='words', how='outer')
#                 words
#0           little red
#1            grow your
#2           james bond
#3           tom brandy
#4  little red corvette
#5      grow your beans
#6           james dean
#7               little

接下来，您要使用Series.str.get_dummies方法：

dummies = df3.words.str.get_dummies(sep='')
#   grow your  grow your beans  james bond  james dean  little  little red  \
#0          0                0           0           0       1           1   
#1          1                0           0           0       0           0   
#2          0                0           1           0       0           0   
#3          0                0           0           0       0           0   
#4          0                0           0           0       1           1   
#5          1                1           0           0       0           0   
#6          0                0           0           1       0           0   
#7          0                0           0           0       1           0   

#   little red corvette  tom brandy  
#0                    0           0  
#1                    0           0  
#2                    0           0  
#3                    0           1  
#4                    1           0  
#5                    0           0  
#6                    0           0  
#7                    0           0

注意，如果一个字符串在words列中不包含其他子字符串，或者如果是一个或多个子字符串的超级字符串，那么它的列将总和为1 - 否则它将总和为一个数字> 1.现在您可以使用此dummies数据框来查找与子字符串对应的索引并将其删除：

bad_rows = [where(df3.words==word)[0][0] 
            for word in list(dummies) 
            if dummies[word].sum() > 1 ]  # only substrings will sum to greater than 1
#[1, 7, 0]

df3.drop( df3.index[bad_rows] , inplace=True)
#                 words
#2           james bond
#3           tom brandy
#4  little red corvette
#5      grow your beans
#6           james dean

注意 - 这将处理超级字符串超过1个子字符串的情况。例如'little'，'little red'都是超级字符串'little red corvette'的子字符串，所以我假设你只保留超级字符串。

检查一个数据框中的单词是否出现在另一个数据框中（python 3，pandas）

3 个答案: