如何在python上使用lemas列表

时间:2015-08-10 07:32:40

标签: python preprocessor

a有问题: 我使用python来分析数据。首先我使用lemmas list(lemas.txt)来预处理我的数据。 我有lemmas列表: 例如:

A-bomb -> A-bombs
abacus -> abacuses
abandon -> abandons,abandoning,abandoned
abase -> abases,abasing,abased
abate -> abates,abating,abated
abbess -> abbesses
abbey -> abbeys
abbot -> abbots

..... 你可以帮我使用我的列表通过python清除我的数据。谢谢

1 个答案:

答案 0 :(得分:1)

此代码将解析您的lemmas文件并将它们放入dict中,其中键是将被替换的单词,其值将替换为它们。

def parse_lemmas(leema_lines):
    for line in lemmas_lines:
        target, from_words_str = line.split(' -> ')
        from_words = from_words_str.split(',')
        for word in from_words:
            yield (word, target)


with open('lemmas.txt', 'r') as lemmas_file:
    lemmas = dict(parse_lemmas(lemma_line.strip() for lemma_line in lemmas_file))

# The dictionary lemmas now has all the lemmas in the lemmas file

将数据分成单词列表后,您可以运行以下代码。

# if your data isn't too large
new_data = [lemmas.get(word, word) for word in data]

# if it's so large you don't want to make another copy,
# you can do it in-place
for idx, word in data:
    data[idx] = lemmas.get(word, word)

请注意,数据不一定只是单词;例如,您可以将"This is your data. This, here, is your data with punctuation; see?"拆分为['This', 'is', 'your', 'data', '.', 'This', ',', 'here', ',', 'is', 'your', 'data', 'with', 'punctuation', ';', 'see', '?']。在这种情况下,标点符号将被传递。最好的方法取决于您的实际数据以及拆分/重新组合时需要保留的信息。