给定一个字符串列表和一个列表列表,如何返回字数?

时间:2015-03-26 23:52:31

标签: python arrays list tuples counter

假设我有一长串带有标点符号,空格等的列表,如下所示:

list_1 = [[the guy was plaguy but unable to play football, but he was able to play tennis],[That was absolute cool],...,[This is an implicit living.]]

我还有另一个长长的清单:

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

如何为list_2的每个子列表提取list_1中显示的所有字词的计数或频率?例如,鉴于以上列表:

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

[the guy was unable to play football, but he was able to play tennis]

由于无法显示在list_2的上一个子列表中,此列表的计数为1

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

[That was absolute cool]

由于前一个子列表中没有list_2的字词,因此计数为0

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

[This is an implicit living.]

由于隐含和生活出现在list_2的上一个子列表中,因此此列表的计数为2

所需的输出为[1,0,2]

知道如何处理此任务以便返回计数列表?先谢谢你们。

例如:

>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]

错误,因为混淆了两个单词guyplayguy。知道如何解决这个问题吗?

3 个答案:

答案 0 :(得分:2)

使用内置函数sum和列表理解

>>> list_1 = [['the guy was unable to play football, but he was able to play tennis'],['That was absolute cool'],['This is implicit living.']]
>>> list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']   
>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]

[1, 0, 2]

答案 1 :(得分:1)

诀窍是使用split()方法和列表推导。如果您只使用空格分隔:

list_1 = ["the guy was unable to play football but he was able to play tennis", "That was absolute cool", "This is implicit living"]

list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']

print([sum(sum(1 for j in list_2 if j in i.split()) for i in k for k) inlist_1])

但是,如果要使用所有非字母数字进行标记,则应使用re

import re

list_1 = ["the guy was unable to play football,but he was able to play tennis", "That was absolute cool", "This is implicit living"]
list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']

print(sum([sum(1 for j in list_2 if re.split("\W",i)) for i in k) for k in list_1])

\W字符集都是非字母数字。

答案 2 :(得分:1)

我宁愿使用正则表达式。首先,因为你需要匹配整个单词,这与其他字符串搜索方法很复杂。而且,即使它看起来像火箭筒,它通常也非常有效。

首先从list_2生成正则表达式,然后使用它搜索list_1的句子。正则表达式的构造如下:"(\bword1\b|\bword2\b|...)"表示“整个word1或整个word2或......”\b表示在单词的开头或结尾处进行匹配。

我假设你想要list_1的每个子列表的结果,而不是每个子列表的每个句子。

_regex = re.compile(r"(\b{}\b)".format(r"\b|\b".join(list_2)))
word_counts = [ 
    sum(
        sum(1 for occurence in _regex.findall(sentence))
        for sentence in sublist
    ) for sublist in list_1
]

Here you can find a whole sample code将性能与普通字符串搜索进行比较,知道匹配整个单词需要更多工作,因此效率更低。