Question

输入是list1 = [＆＃39;水汽＆＃39;蒸发＆＃39;＆＃39;二氧化碳＆＃39;＆＃39;阳光＆＃39;＆＃39;绿色植物＆＃39;]

输出应为

list1=['evaporation','sunlight']
for i in list1:
    " " not in i
    print i

False - water vapor
True - evaporation
False - carbon dioxide
True - sunlight
False - green plants

Answer 1

如果您需要根据条件从列表中删除元素，可以使用filter()或list comprehension。

您可以了解检查非单字组的单词：" " in word。

基本上，如果要使用for循环构造新列表，可以编写如下内容：

new_list = []
for word in words:
    if " " in word:  # This is not an unigram word
        new_list.append(word)

由于Python语法，这可以更简单：

new_list = [word for word in words if " " in word]

可替换地：

new_list = list(filter(lambda word: " " in word, words))

两者都将返回非单字组的单词列表，如问题标题中所述（即使您的示例返回单字组...）

Answer 2

这些字符串不是包含单个字词的字符串，例如“蒸发”＆amp; “阳光”unigrams？在我看来，你想保留 unigrams，而不是删除它们。

你可以使用列表理解来做到这一点：

list1 = ['water vapor','evaporation','carbon dioxide','sunlight','green plants']
unigrams = [word for word in list1 if ' ' not in word]

>>> print unigrams
['evaporation', 'sunlight']

这假定单词由一个或多个空格分隔。这可能过于简单化了什么构成了n> n的n-gram。 1，因为不同的空白字符可以界定单词，例如选项卡，新行，各种空格unicode代码点等。您可以使用regular expression：

import re

list1 = ['water vapor','evaporation','carbon dioxide','sunlight','green plants', 'word with\ttab', 'word\nword', 'abcd\refg']
unigram_pattern = re.compile('^\S+$')    # string contains only non-whitespace chars
unigrams = [word for word in list1 if unigram_pattern.match(word)]

>>> print unigrams
['evaporation', 'sunlight']

模式^\S+$表示从字符串的开头匹配所有非空白字符，直到字符串结尾。

如果需要支持unicode空格，可以在编译模式时指定unicode标志：

list1.extend([u'punctuation\u2008space', u'NO-BREAKu\u00a0SPACE'])
unigram_pattern = re.compile('^\S+$', re.UNICODE)
unigrams = [word for word in list1 if unigram_pattern.match(word)]

>>> print unigrams
['evaporation', 'sunlight']

现在它还将过滤掉那些包含unicode空格的字符串，例如：非中断空间（U + 00A0）和标点符号空间（U + 2008）。

如何从python中的列表中获取unigrams（单词）？

2 个答案: