Question

延伸至： Removing list of words from a string

我有以下数据框，我想从df.name列中删除经常出现的单词：

df：

stretch

我使用以下代码创建包含字词及其频率的新数据框：

name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh  
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark

将导致

df2：

df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]

然后我将其转换为包含以下代码段的字典：

word    freq
Clinton 4
Bill    3
James   3
Clark   3

现在，如果我要删除d中的df.name中的单词（这是字典，单词：freq），请使用以下代码段：

    d = dict(zip(df['word'], df['freq']))

但实际上我的数据帧（df）包含近240k行，并且我使用大于100的阈值（thresh = 3，在上面的样本中）。因此，由于复杂的搜索，上面的代码需要大量的时间来运行。是否有任何有效的方法让它更快？

以下是所需的输出：

def check_thresh_word(merc,d):
    m = merc.split(' ')
    for i in range(len(m)):
            if m[i] in d.keys():
                return False
    else:
        return True

def rm_freq_occurences(merc,d):
    if check_thresh_word(merc,d) == False:
        nwords = merc.split(' ')
        rwords = [word for word in nwords if word not in d.keys()]
        m = ' '.join(rwords)
    else:
        m=merc
    return m

df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))

提前致谢!!!!!!!

Answer 1

通过加入列word的所有值创建的正则表达式使用replace，最后strip跟踪空格：

data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()

另一个解决方案是为选择零个或多个空格添加\s*：

pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*

data.name = data.name.replace(pat, '', regex=True)

print (data)
          name
0       Hayden
1         Rock
2        Gates
3       Vishal
4     Cameroon
5        Micky
6      Michael
7   Tony Waugh
8          Tom
9          Tom
10     Avinash
11     Shreyas
12      Ramesh
13        Adam

如何从Pandas中的字典中删除数据框列中的单词

1 个答案: