我有每行包含一种发音的文本数据。我想提取它,所以我有一个列表,其中包含所有相同行长的语音。
这是我的数据input.txt
I am very happy today.
Are you angry with me...? No?
Oh my dear, you look so beautiful.
Let's take a rest, I am so tired.
Excuse me. This is my fault.
当前,我使用以下python代码:
from nltk import tokenize
utterances = []
with open('input.txt', 'r') as myfile:
for line in myfile.readlines():
utterance = tokenize.sent_tokenize(line)
utterances = np.append(utterances, utterance)
utterances = list(utterances)
len(utterances)
给出的话语总数为7,与输入数据应为5。
我希望获得以下输出(5句话列表)
['I am very happy today.', 'Are you angry to me...? No?', 'Oh my dear, you looks so beautiful.', "Let's take a rest, I am so tired.", 'Excuse me. This is my fault.']
虽然上面的当前python代码产生了以下输出(7个句子)。
['I am very happy today.', 'Are you angry to me...?', 'No?', 'Oh my dear, you look so beautiful.', "Let's take a rest, I am so tired.", 'Excuse me.', 'This is my fault.']
还有什么比NLTK的tokenize.sent_tokenize
好吗?我认为这是我得到错误结果的原因。
答案 0 :(得分:1)
仅在没有np.append()
和'sent_tokenize'的情况下添加到列表中
from nltk import tokenize
utterances = []
with open('input.txt', 'r') as myfile:
for line in myfile.readlines():
utterance = line.strip('\n')
utterances.append(utterance)
print(utterances)
答案 1 :(得分:0)
在此行
utterance = tokenize.sent_tokenize(line)
您要nltk
将数据标记为句子,而不是语音。此函数认为?
和.
标记句子的结尾。您的两行数据中有多个句子终止符,因此分词器将其视为两个句子。这就是为什么您的结果包含7个句子(而不是您报告的8个句子)的原因:第2行和第5行分别分为两个句子。