Question

我试图用Python / NLTK隔离一系列句子中的第一个单词。

创建了一个不重要的句子系列（the_text），虽然我能够将它分成标记化的句子，但我无法将每个句子的第一个单词成功地分成一个列表（first_words）。

[[＆＃39;此处＆＃39;，＆＃39;＆＃39;，＆＃39;某些＆＃39;，＆＃39;文字＆＃39;，＆＃39;。＆＃39; ]，[＆＃39;有＆＃39;，＆＃39;是＆＃39;＆＃39; a＆＃39;，＆＃39; a＆＃39;＆＃39;＆＃39;＆＃39;，＆＃ 39; on＆＃39;，＆＃39;＆＃39;，＆＃39; lawn＆＃39;，＆＃39;。＆＃39;]，[＆＃39;我＆＃39;，＆＃39;我＆＃39;，＆＃39;迷茫＆＃39;，＆＃39;。＆＃39;]，[＆＃39;有＆＃39;，＆＃39;是＆＃39;，＆＃39;更多＆＃ 39;，＆＃39;。＆＃39;]，[＆＃39;此处＆＃39;，＆＃39;是＆＃39;，＆＃39;某些＆＃39;，＆＃39;更多＆＃39; ，＆＃39;]，[＆＃39;我＆＃39;，＆＃39;做＆＃39;，＆＃34; n＆＃39; t＆＃34;，＆＃39;知道＆＃ 39;，＆＃39;任何＆＃39;，＆＃39;。＆＃39;]，[＆＃39;我＆＃39;，＆＃39;应该＆＃39;，＆＃39;添加＆＃39; ，＆＃39;更多＆＃39;，＆＃39;。＆＃39;]，[＆＃39; Look＆＃39;，＆＃39;，＆＃39;，＆＃39; here＆＃39;，＆＃39;是＆＃39;，＆＃39;更多＆＃39;，＆＃39;文字＆＃39;，＆＃39;。＆＃39;]，[＆＃39;如何＆＃39;，＆＃ 39;伟大＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;，＆＃39;？＆＃39;]]

the_text="Here is some text. There is a a person on the lawn. I am confused. "
the_text= (the_text + "There is more. Here is some more. I don't know anything. ")
the_text= (the_text + "I should add more. Look, here is more text. How great is that?")

sents_tok=nltk.sent_tokenize(the_text)
sents_words=[nltk.word_tokenize(sent) for sent in sents_tok]
number_sents=len(sents_words)
print (number_sents)
print(sents_words)
for i in sents_words:
    first_words=[]
    first_words.append(sents_words (i,0))
print(first_words)

感谢您的帮助！

Answer 1

您的代码有三个问题，您必须修复所有这三个问题才能使其正常工作：

for i in sents_words:
    first_words=[]
    first_words.append(sents_words (i,0))

首先，您每次循环时都会删除first_words：将first_words=[]移到循环之外。

其次，您将函数调用语法（括号）与索引语法（括号）混合在一起：您想要sents_words[i][0]。

第三，for i in sents_words:遍历sents_words的元素，而不是索引。所以你只想要i[0]。（或者，for i in range(len(sents_words))，但没有理由这样做。）

所以，把它放在一起：

first_words=[]
for i in sents_words:
    first_words.append(i[0])

如果您对comprehensions一无所知，您可能会发现这种模式（以空列表开头，迭代某些内容，将一些表达式附加到列表中）正是列表理解所做的：

first_words = [i[0] for i in sents_words]

如果你没有，那么现在是了解理解的好时机，或者不担心这部分。：）

Answer 2

>>> sents_words = [['Here', 'is', 'some', 'text', '.'],['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], 'I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]

您可以使用循环到append到先前已初始化的list：

>>> first_words = []
>>> for i in sents_words:
...     first_words.append(i[0])
...
>>> print(*first_words)
Here There I There Here I I Look How

或理解（用括号替换那些方括号以代替创建生成器）：

>>> first_words = [i[0] for i in sents_words]
>>> print(*first_words)
Here There I There Here I I Look How

或者如果您不需要保存以供以后使用，您可以直接打印项目：

>>> print(*(i[0] for i in sents_words))
Here There I There Here I I Look How

Answer 3

以下是如何访问列表和列表列表中的项目的示例：

>>> fruits = ['apple','orange', 'banana']
>>> fruits[0]
'apple'
>>> fruits[1]
'orange'
>>> cars = ['audi', 'ford', 'toyota']
>>> cars[0]
'audi'
>>> cars[1]
'ford'
>>> things = [fruits, cars]
>>> things[0]
['apple', 'orange', 'banana']
>>> things[1]
['audi', 'ford', 'toyota']
>>> things[0][0]
'apple'
>>> things[0][1]
'orange'

对你来说问题：

>>> from nltk import sent_tokenize, word_tokenize
>>> 
>>> the_text="Here is some text. There is a a person on the lawn. I am confused. There is more. Here is some more. I don't know anything. I should add more. Look, here is more text. How great is that?"
>>> 
>>> tokenized_text = [word_tokenize(s) for s in sent_tokenize(the_text)]
>>> 
>>> first_words = []
>>> # Iterates through the sentneces.
... for sent in tokenized_text:
...     print sent
... 
['Here', 'is', 'some', 'text', '.']
['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.']
['I', 'am', 'confused', '.']
['There', 'is', 'more', '.']
['Here', 'is', 'some', 'more', '.']
['I', 'do', "n't", 'know', 'anything', '.']
['I', 'should', 'add', 'more', '.']
['Look', ',', 'here', 'is', 'more', 'text', '.']
['How', 'great', 'is', 'that', '?']
>>> # First words in each sentence.
... for sent in tokenized_text:
...     word0 = sent[0]
...     first_words.append(word0)
...     print word0
...     
... 
Here
There
I
There
Here
I
I
Look
How

>>> print first_words ['Here', 'There', 'I', 'There', 'Here', 'I', 'I', 'Look', 'How']

在列表推导的单行中：

# From the_text, you extract the first word directly
first_words = [word_tokenize(s)[0] for s in sent_tokenize(the_text)]

# From tokenized_text
tokenized_text= [word_tokenize(s) for s in sent_tokenize(the_text)]
first_words = [w[0] for s in tokenized_text]

Answer 4

另一种选择，虽然它与abarnert的建议非常相似：

first_words = []
for i in range(number_sents):
    first_words.append(sents_words[i][0])

如何在列表列表中选择每个列表的第一个元素？

4 个答案: