在python中每行提取语音

时间:2018-08-24 09:24:49

标签: python nltk python-textprocessing

我有每行包含一种发音的文本数据。我想提取它,所以我有一个列表,其中包含所有相同行长的语音。

这是我的数据input.txt

的示例
I am very happy today.
Are you angry with me...? No?
Oh my dear, you look so beautiful.
Let's take a rest, I am so tired. 
Excuse me. This is my fault.

当前,我使用以下python代码:

from nltk import tokenize

utterances = []
with open('input.txt', 'r') as myfile:
    for line in myfile.readlines():
        utterance = tokenize.sent_tokenize(line)
        utterances = np.append(utterances, utterance)
utterances = list(utterances)
len(utterances)

给出的话语总数为7,与输入数据应为5。

我希望获得以下输出(5句话列表)

['I am very happy today.', 'Are you angry to me...? No?', 'Oh my dear, you looks so beautiful.', "Let's take a rest, I am so tired.", 'Excuse me. This is my fault.']

虽然上面的当前python代码产生了以下输出(7个句子)。

['I am very happy today.', 'Are you angry to me...?', 'No?', 'Oh my dear, you look so beautiful.', "Let's take a rest, I am so tired.", 'Excuse me.', 'This is my fault.']

还有什么比NLTK的tokenize.sent_tokenize好吗?我认为这是我得到错误结果的原因。

2 个答案:

答案 0 :(得分:1)

仅在没有np.append()和'sent_tokenize'的情况下添加到列表中

from nltk import tokenize

utterances = []
with open('input.txt', 'r') as myfile:
for line in myfile.readlines():
    utterance = line.strip('\n')
    utterances.append(utterance)
print(utterances)

答案 1 :(得分:0)

在此行

utterance = tokenize.sent_tokenize(line)

您要nltk将数据标记为句子,而不是语音。此函数认为?.标记句子的结尾。您的两行数据中有多个句子终止符,因此分词器将其视为两个句子。这就是为什么您的结果包含7个句子(而不是您报告的8个句子)的原因:第2行和第5行分别分为两个句子。