短信语言文本扩展器 - 熊猫

时间:2018-03-25 04:00:02

标签: python pandas text nltk str-replace

目标是用扩展替换文本中的SMS。我通过比较pandas中存储的列值并将其在python中读取为xlsx来实现此目的。

word    expansion
fyi     for your information
gtg     got to go
brb     be right back
gtg2    got to go too
fyii    sample test

到目前为止的努力:

提供者:

Replace words by checking from pandas dataframe

import re
import pandas as pd
sdf = pd.read_excel('expansion.xlsx')
rep = dict(zip(sdf.word, sdf.expansion)) #convert into dictionary
words = "fyi gtg gtg2 fyii really "
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
rep = pattern.sub(lambda m: rep[re.escape(m.group(0))], words)
print rep

输出:

for your information got to go got to go2 for your informationi really 

预期产出:

 for your information got to go got to go too sample text really 

如何逐字检查?

1 个答案:

答案 0 :(得分:1)

我不知道它是否与您的要求完全匹配,但您可以尝试将单词边界(\ b)放在模式中每个单词的末尾,以便考虑整个单词:

import re
import pandas as pd
sdf = pd.read_excel('expansion.xlsx')
rep = dict(zip(sdf.word, sdf.expansion)) #convert into dictionary
words = "fyi gtg gtg2 fyii really "
rep = dict((re.escape(k), v) for k, v in rep.items())
pattern = re.compile(r"\b|".join(rep.keys())+r"\b") # This line changes
rep = pattern.sub(lambda m: rep[re.escape(m.group(0))], words)
print rep

输出:

for your information got to go got to go too sample test really