a有问题: 我使用python来分析数据。首先我使用lemmas list(lemas.txt)来预处理我的数据。 我有lemmas列表: 例如:
A-bomb -> A-bombs
abacus -> abacuses
abandon -> abandons,abandoning,abandoned
abase -> abases,abasing,abased
abate -> abates,abating,abated
abbess -> abbesses
abbey -> abbeys
abbot -> abbots
..... 你可以帮我使用我的列表通过python清除我的数据。谢谢
答案 0 :(得分:1)
此代码将解析您的lemmas文件并将它们放入dict中,其中键是将被替换的单词,其值将替换为它们。
def parse_lemmas(leema_lines):
for line in lemmas_lines:
target, from_words_str = line.split(' -> ')
from_words = from_words_str.split(',')
for word in from_words:
yield (word, target)
with open('lemmas.txt', 'r') as lemmas_file:
lemmas = dict(parse_lemmas(lemma_line.strip() for lemma_line in lemmas_file))
# The dictionary lemmas now has all the lemmas in the lemmas file
将数据分成单词列表后,您可以运行以下代码。
# if your data isn't too large
new_data = [lemmas.get(word, word) for word in data]
# if it's so large you don't want to make another copy,
# you can do it in-place
for idx, word in data:
data[idx] = lemmas.get(word, word)
请注意,数据不一定只是单词;例如,您可以将"This is your data. This, here, is your data with punctuation; see?"
拆分为['This', 'is', 'your', 'data', '.', 'This', ',', 'here', ',', 'is', 'your', 'data', 'with', 'punctuation', ';', 'see', '?']
。在这种情况下,标点符号将被传递。最好的方法取决于您的实际数据以及拆分/重新组合时需要保留的信息。