拆分句子,处理单词并将句子重新组合在一起?

时间:2019-03-30 16:31:07

标签: python text split nltk sentence

我有一个对单词评分的功能。我有很多文本,从句子到几个页面文档。我一直在研究如何给单词打分并在其原始状态附近返回文本。

这是一个例句:

"My body lies over the ocean, my body lies over the sea."

我想要产生的是以下内容:

"My body (2) lies over the ocean (3), my body (2) lies over the sea."

下面是我的评分算法的虚拟版本。我已经弄清楚了如何提取文字,将其撕裂并对其评分。

但是,我坚持如何将其放回所需的格式。

这是我的函数的虚拟版本:

def word_score(text):
    words_to_work_with = []
    words_to_return = []
    passed_text = TextBlob(passed_text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)
    for word in words to work with:
        if word == 'body':
            score = 2
        if word == 'ocean':
            score = 3
        else:
            score = None
        words_to_return.append((word,score))
    return words_to_return

我是相对的新手,所以我有两个问题:

  1. 如何将文本重新放在一起,
  2. 该逻辑应该放在函数中还是在函数外部?

我真的很希望能够将整个段(即句子,文档)输入到函数中,并让其返回它们。

谢谢您的帮助!

3 个答案:

答案 0 :(得分:1)

因此,基本上,您希望为每个单词赋予一个分数。您可以使用dictionary而不是几个if语句来改善您提供的功能。 另外,您还必须返回所有分数,而不仅仅是返回word中第一个words_to_work_with的分数,这是该函数的当前行为,因为它将在第一次迭代时返回一个整数。 因此,新功能将是:

def word_score(text)
    words_to_work_with = []
    passed_text = TextBlob(text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word) # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
    return [dict_scores.get(word, None)] # if word is not recognized, score is None

对于第二部分,它是在重建字符串,实际上我将在同一函数中执行此操作(因此这将回答您的第二个问题):

def word_score_and_reconstruct(text):
    words_to_work_with = []
    passed_text = TextBlob(text)

    reconstructed_text = ''

    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)  # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body': 2, 'ocean': 3}
    dict_strings = {'body': ' (2)', 'ocean': ' (3)'}

    word_scores = []

    for word in words_to_work_with:
        word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here

        # we add 'word'+'(word's score)', only if the word has a score
        # if not, we add the default value '' meaning we don't add anything
        reconstructed_text += word + dict_strings.get(word, '')

    return reconstructed_text, word_scores

我不保证该代码在初次尝试时会起作用,我无法对其进行测试,但是它将为您提供主要思想

答案 1 :(得分:0)

希望这会有所帮助。根据您的问题,它对我有用。

最诚挚的问候!

"""
Python 3.7.2

Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea. 
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file

output_text = []

for line in input_file:
    words =  line.split()
    for word in words:
        if word == 'body':
            output_text.append('body (2)')
            output_file.write('body (2) ')
        elif word == 'body,':
            output_text.append('body (2),')
            output_file.write('body (2), ')
        elif word == 'ocean':
            output_text.append('ocean (3)')
            output_file.write('ocean (3) ')
        elif word == 'ocean,':
            output_text.append('ocean (3),')
            output_file.write('ocean (3), ')
        else:
            output_text.append(word)
            output_file.write(word+' ')

print (output_text)
input_file.close()
output_file.close()

答案 2 :(得分:0)

这是一个可行的实现。该函数首先将输入文本解析为一个列表,以便每个列表元素是一个单词或标点符号的组合(例如,逗号后跟一个空格。)一旦处理了列表中的单词,它将组合列表回到字符串并返回它。

def word_score(text):
    words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
    for i,word in enumerate(words_to_work_with):
        if word.isalpha():
            words_to_work_with[i] = inflection.singularize(word).lower()
            words_to_work_with[i] = lemmatizer.lemmatize(word)
            if word == 'body':
               words_to_work_with[i] = 'body (2)'
            elif word == 'ocean':
               words_to_work_with[i] = 'ocean (3)'
    return ''.join(words_to_work_with)

txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)

输出:

My body (2) lie over the ocean (3), my body (2) lie over the sea.

如果您要给两个以上的单词打分,那么使用字典代替if条件确实是个好主意。