将句子列表分成列表中的单独单词

时间:2015-03-16 01:41:14

标签: python

我有一个由

行组成的列表
lines =  ['The query complexity of estimating weighted averages.',
     'New bounds for the query complexity of an algorithm that learns',
     'DFAs with correction equivalence queries.',
     'general procedure to check conjunctive query containment.']

我需要将它作为'单独的单词'

存储在列表中
lines =  ['The','query', 'complexity' ,'of' ,'estimating', 'weighted','averages.'
     ,'New' ......]

如何将其作为单独的单词列表获取?

4 个答案:

答案 0 :(得分:3)

您可以使用list comprehension

>>> lines =  [
...     'The query complexity of estimating weighted averages.',
...     'New bounds for the query complexity of an algorithm that learns',
... ]
>>> [word for line in lines for word in line.split()]
['The', 'query', 'complexity', 'of', 'estimating', 'weighted','averages.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns']

答案 1 :(得分:2)

您可以加入所有行,然后使用split:

" ".join(lines).split()

或者你可以分割每一行和每一行:

from itertools import chain
list(chain(*map(str.split, lines)))

答案 2 :(得分:0)

您可以通过以下方式实现:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

lines =  ['The query complexity of estimating weighted averages.',
 'New bounds for the query complexity of an algorithm that learns',
 'DFAs with correction equivalence queries.',
 'general procedure to check conjunctive query containment.']

joint_words = ' '.join(lines)

separated_words = word_tokenize(joint_words)

print(separated_words)

输出将是:

['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages', '.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries', '.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment', '.']

此外,如果要将点与上一个字符串(在列表中显示为独立的字符串)合并,请运行以下代码:

for i, j in enumerate(separated_words):
    if '.' in j:
        separated_words[i-1] = separated_words[i-1] + separated_words[i]
        del separated_words[i]    # For deleting duplicate entry

print(separated_words)

输出将是:

['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment.']

答案 3 :(得分:-1)

听起来你想要类似于this的东西,其中字符串是根据空格分割的。

lines[0].split()

以上将使用该字符串中的空格分割您的行列表(似乎包含1个项目)。