熊猫字符串,无需for循环即可替换多个单词

时间:2018-07-09 14:17:31

标签: python pandas

我在Pandas df中大约有130万个字符串(代表用户向IT服务台发送邮件时的要求)。我还有一系列29,813个名称,我想从这些字符串中删除,以便只剩下描述问题的单词。这是数据的一个迷你示例-可以工作,但是花费的时间太长。我正在寻找一种更有效的方法来获得此结果:

输入:

List1 = ["George Lucas has a problem logging in", 
         "George Clooney is trying to download data into a spreadsheet", 
         "Bart Graham needs to logon to CRM urgently", 
         "Lucy Anne George needs to pull management reports"]
List2 = ["Access Team", "Microsoft Team", "Access Team", "Reporting Team"]

df = pd.DataFrame({"Team":List2, "Text":List1})

xwords = pd.Series(["George", "Lucas", "Clooney", "Lucy", "Anne", "Bart", "Graham"])

for word in range(len(xwords)):
    df["Text"] = df["Text"].str.replace(xwords[word], "! ")

# Just using ! in the example so one can clearly see the result

输出:

Team                Text
0   Access Team     ! ! has a problem logging in
1   Microsoft Team  ! ! is trying to download data into a spreadsheet
2   Access Team     ! ! needs to logon to CRM urgently
3   Reporting Team  ! ! ! needs to pull management reports

我尝试寻找答案已经有一段时间了:如果由于缺乏经验而在某个地方错过了答案,请保持温柔,让我知道!

非常感谢:)

3 个答案:

答案 0 :(得分:1)

感谢CiprianTomoiagă向我指出了帖子Speed up millions of regex replacements in Python 3。由Eric Duminil提供的选项,请参阅“如果需要最快的解决方案,请使用此方法(通过设置查找)”,在带有系列而不是列表的Pandas环境中同样有效-下面重复说明此问题的示例代码,数据集,整个事情在2.54秒内完成!

输入:

import re

banned_words = set(word.strip().lower() for word in xwords)

def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = df["Text"]

word_pattern = re.compile('\w+')

df["Text"] = [word_pattern.sub(delete_banned_words, sentence) for sentence in sentences]
print(df)

输出:

Team              Text
Access Team       has a problem logging in
Microsoft Team    is trying to download data into a spreadsheet
Access Team       needs to logon to CRM urgently
Reporting Team    needs to pull management reports

答案 1 :(得分:0)

我建议标记文本并使用一组名称:

xwords = set(["George", "Lucas", ...])
df["Text"] = ' '.join(filter(lambda x: x not in xwords, df["Text"].str.split(' ')))

根据字符串的不同,标记化不仅需要在空格上进行分割,而且还需要更加复杂。

可能会有熊猫特有的方法来做到这一点,但我对此没有经验;)

答案 2 :(得分:0)

pandas.Series.str.replace可以将已编译的正则表达式作为模式

import re
patt = re.compile(r'|'.join(xwords))
df["Text"] = df["Text"].str.replace(patt, "! ")

也许这可以帮助?不过,我没有这么长的正则表达式的经验。