如何从文件中读取ngrams,然后将它们与令牌进行匹配

时间:2017-12-19 19:03:34

标签: python python-3.x n-gram

我想阅读保存在文件中的ngrams。然后将这些ngram中的每个单词与我的语料库中的单个标记匹配,如果它与之匹配,则将其替换为ngram.let说我有这些双字母:

painful punishment
worldly life
straight path
Last Day
great reward
severe punishment
clear evidence

我想要做的是阅读第一个二元组,然后拆分它并用语料库中的令牌将其第一个单词“痛苦地”贬低,其中它与令牌移动到下一个标记并将其与下一个单词bigram匹配如果它是“惩罚”,那么用一个标记替换它作为“痛苦的punsihment”。我不知道该怎么做。我想把这个逻辑转换成代码。如果有人能帮助我,我会非常感激。

1 个答案:

答案 0 :(得分:0)

首先,这不是StackOverflow的问题(听起来像是一个家庭作业问题)。您可以通过Google轻松识别各种方法来实现此目的。然而,我会给你一个解决方案,因为我需要热身:

# -*- coding: utf-8 -*-

import traceback, sys, re

'''
Open the bigrams file and load into an array.
Assuming bigrams are cleaned (else, you can do this using method below).
'''
try:
    with open('bigrams.txt') as bigrams_file:
        bigrams = bigrams_file.read().splitlines()
except Exception:
    print('BIGRAMS LOAD ERROR: '+str(traceback.format_exc()))
    sys.exit(1)

test_input = 'There is clear good evidence a great reward is in store.'

'''
Clean input method.
'''
def clean_input(text_input):
    text_input = text_input.lower()
    text_input = text_input.strip(' \t\n\r')
    alpha_num_underscore_only = re.compile(r'([^\s\w_])+', re.UNICODE)
    text_input = alpha_num_underscore_only.sub(' ', text_input)
    text_input = re.sub(' +', ' ', text_input)
    return text_input.strip()

test_input_words = test_input.split()
test_input_clean = clean_input(test_input)
test_input_clean_words = test_input_clean.split()

'''
Loop through the test_input bigram by bigram.
If we match one, then increment the index to move onto the next bigram.
This is a quick implementation --- you can modify for efficiency, and higher-order n-grams.
'''
output_text = []
skip_index = 0
for i in range(len(test_input_clean_words)-1):
    if i >= skip_index:
        if ' '.join([test_input_clean_words[i], test_input_clean_words[i+1]]) in bigrams:
            print(test_input_clean_words[i], test_input_clean_words[i+1])
            skip_index = i+2
            output_text.append('TOKEN_'+'_'.join([test_input_words[i], test_input_words[i+1]]).upper())
        else:
            skip_index = i+1
            output_text.append(test_input_words[i])
output_text.append(test_input_words[len(test_input_clean_words)-1])

print(' '.join(output_text))

输入:

There is clear good evidence a great reward is in store.

输出:

There is clear good evidence a TOKEN_GREAT_REWARD is in store.