Question

我有一个对单词评分的功能。我有很多文本，从句子到几个页面文档。我一直在研究如何给单词打分并在其原始状态附近返回文本。

这是一个例句：

"My body lies over the ocean, my body lies over the sea."

我想要产生的是以下内容：

"My body (2) lies over the ocean (3), my body (2) lies over the sea."

下面是我的评分算法的虚拟版本。我已经弄清楚了如何提取文字，将其撕裂并对其评分。

但是，我坚持如何将其放回所需的格式。

这是我的函数的虚拟版本：

def word_score(text):
    words_to_work_with = []
    words_to_return = []
    passed_text = TextBlob(passed_text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)
    for word in words to work with:
        if word == 'body':
            score = 2
        if word == 'ocean':
            score = 3
        else:
            score = None
        words_to_return.append((word,score))
    return words_to_return

我是相对的新手，所以我有两个问题：

如何将文本重新放在一起，
该逻辑应该放在函数中还是在函数外部？

我真的很希望能够将整个段（即句子，文档）输入到函数中，并让其返回它们。

谢谢您的帮助！

Answer 1

因此，基本上，您希望为每个单词赋予一个分数。您可以使用dictionary而不是几个if语句来改善您提供的功能。另外，您还必须返回所有分数，而不仅仅是返回word中第一个words_to_work_with的分数，这是该函数的当前行为，因为它将在第一次迭代时返回一个整数。因此，新功能将是：

def word_score(text)
    words_to_work_with = []
    passed_text = TextBlob(text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word) # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
    return [dict_scores.get(word, None)] # if word is not recognized, score is None

对于第二部分，它是在重建字符串，实际上我将在同一函数中执行此操作（因此这将回答您的第二个问题）：

def word_score_and_reconstruct(text):
    words_to_work_with = []
    passed_text = TextBlob(text)

    reconstructed_text = ''

    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)  # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body': 2, 'ocean': 3}
    dict_strings = {'body': ' (2)', 'ocean': ' (3)'}

    word_scores = []

    for word in words_to_work_with:
        word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here

        # we add 'word'+'(word's score)', only if the word has a score
        # if not, we add the default value '' meaning we don't add anything
        reconstructed_text += word + dict_strings.get(word, '')

    return reconstructed_text, word_scores

我不保证该代码在初次尝试时会起作用，我无法对其进行测试，但是它将为您提供主要思想

Answer 2

希望这会有所帮助。根据您的问题，它对我有用。

最诚挚的问候！

"""
Python 3.7.2

Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea. 
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file

output_text = []

for line in input_file:
    words =  line.split()
    for word in words:
        if word == 'body':
            output_text.append('body (2)')
            output_file.write('body (2) ')
        elif word == 'body,':
            output_text.append('body (2),')
            output_file.write('body (2), ')
        elif word == 'ocean':
            output_text.append('ocean (3)')
            output_file.write('ocean (3) ')
        elif word == 'ocean,':
            output_text.append('ocean (3),')
            output_file.write('ocean (3), ')
        else:
            output_text.append(word)
            output_file.write(word+' ')

print (output_text)
input_file.close()
output_file.close()

Answer 3

这是一个可行的实现。该函数首先将输入文本解析为一个列表，以便每个列表元素是一个单词或标点符号的组合（例如，逗号后跟一个空格。）一旦处理了列表中的单词，它将组合列表回到字符串并返回它。

def word_score(text):
    words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
    for i,word in enumerate(words_to_work_with):
        if word.isalpha():
            words_to_work_with[i] = inflection.singularize(word).lower()
            words_to_work_with[i] = lemmatizer.lemmatize(word)
            if word == 'body':
               words_to_work_with[i] = 'body (2)'
            elif word == 'ocean':
               words_to_work_with[i] = 'ocean (3)'
    return ''.join(words_to_work_with)

txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)

输出：

My body (2) lie over the ocean (3), my body (2) lie over the sea.

如果您要给两个以上的单词打分，那么使用字典代替if条件确实是个好主意。

拆分句子，处理单词并将句子重新组合在一起？

3 个答案: