Question

我是主题建模的新手。使用NLTK进行标记后，例如，我有以下标记：

'1-in', '1-joerg', '1-justine', '1-lleyton', '1-million', '1-nil', '1of','00pm-ish', '01.41', '01.57','0-40', '0-40f',

我相信它们是毫无意义的，在我的其余过程中无法帮助我。这是对的吗？如果是这样，是否有人对正则表达式或...有一个想法，应该用来从我的令牌列表中删除这些令牌（它们是如此不同，因此我无法想到正则表达式）？

Answer 1

我发现摆脱字符串中不想要的单词的最简单方法是使用csv将其替换为空格。

import re

def word_replace(text, replace_dict):
rc = re.compile(r"[A-Za-z_]\w*")

def translate(match):
    word = match.group(0).lower()
    print(word)
    return replace_dict.get(word, word)

return rc.sub(translate, text)

old_text = open('C:/the_file_with_this_string').read()

replace_dict = {
"unwanted_string1" : '',
"unwanted_string2" : '',
"unwanted_string3" : '',
"unwanted_string4" : '',
"unwanted_string5" : '',
"unwanted_string6" : '',
"unwanted_string7" : '',
"unwanted_string8" : '',
"unwanted_string9" : '',
"unwanted_string10" : ''
 }

output = word_replace(old_text, replace_dict)
f = open("C:/the_file_with_this_string", 'w')
f.write(output)
print(output)

用字符串

替换文件路径到'C：/ the_file_with_this_string'

用您要摆脱的字符串替换不需要的字符串（＃）

从Python中的文本中删除其他无意义的标记

1 个答案: