Question

我有三组文本块（实际上还有更多……），它们显示了完整文本的一部分。但是，由于某些句子被分成两个文本块，因此原始文本的划分没有正确完成。

text1 = {"We will talk about data about model specification parameter \
estimation and model application and the context where we will apply \
the simple example.Is an application where we would like to analyze \
the market for electric cars because"};

text2 = {"we are interested in the market of electric cars.The choice \
that we are interested in is the choice of each individual to \
purchase an electric car or not And we will see how"};

text3 = {"to address this question. Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "};

例如text2开头为“我们对电动汽车市场感兴趣”。这是一个不完整的第一句话，实际上是在文本块1中开始的（请参见此处的最后一句话）。

我想确保每个文本块都以完整的句子结尾。所以我想将不完整的第一句移到最后一个文本块。例如，在这里，结果将是：

 text1corr = {"We will talk about data about model specification parameter \
    estimation and model application and the context where we will apply \
    the simple example.Is an application where we would like to analyze \
    the market for electric cars because we are interested in the market of electric cars."};

text2corr = {"The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question."};

text3corr = {"Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "};

如何在Python中完成？这有可能吗？

Answer 1

text1 = "We will talk about data about model specification parameter \
estimation and model application and the context where we will apply \
the simple example.Is an application where we would like to analyze \
the market for electric cars because"

text2 = "we are interested in the market of electric cars.The choice \
that we are interested in is the choice of each individual to \
purchase an electric car or not And we will see how"

text3 = "to address this question. Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "

textList = [text1,text2,text3]

corrected_list = []
prev_incomplete_sentece = ''
for index , text in enumerate(textList):
    if(len(prev_incomplete_sentece) > 0):
        corrected_text =  text[len(prev_incomplete_sentece) + 1:]
    else:
        corrected_text = text
    if(index +1 < len(textList)):
        corrected_text += ' '+ textList[index+1].split('.')[0]
        prev_incomplete_sentece = textList[index+1].split('.')[0]
    corrected_list.append(corrected_text)

输出：

['We will talk about data about model specification parameter estimation and model application and the context where we will apply the simple example.Is an application where we would like to analyze the market for electric cars because we are interested in the market of electric cars',
 'The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question',
 ' Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. ']

Answer 2

您可以使用函数zip_longest()遍历字符串对：

from itertools import zip_longest
import re

l = [text1, text2, text3]
new_l = []

for i, j in zip_longest(l, l[1:], fillvalue=''):
    # remove leading and trailing spaces
    i, j = i.strip(), j.strip()
    # remove leading half sentence
    if i[0].islower():
        i = re.split(r'[.?!]', i, 1)[-1].lstrip()
    # append half sentence from next string
    if i[-1].isalpha():
        j = re.split(r'[.?!]', j, 1)[0]
        i = f"{i} {j}."
    new_l.append(i)

for i in new_l:
    print(i)

输出：

We will talk about data about model specification parameter estimation and model application and the context where we will apply the simple example.Is an application where we would like to analyze the market for electric cars because we are interested in the market of electric cars.
The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question.
Furthermore, it needs to be noted that this is only a model text and there is no content associated with it.

重新排列文本块，使每个文本块以完整的句子结尾

2 个答案: