假设我总共有5个列表
# Sample data
a1 = [1,2,3,4,5,6,7]
a2= [1,21,35,45,58]
a3= [1,2,15,27,36]
a4=[2,3,1,45,85,51,105,147,201]
a5=[3,458,665]
我需要找到a1的元素,它们也存在于a2,a3,a4,a5中超过3次,包括a1中的元素
或
我需要所有列表(a1 - a5)中频率大于或等于3的元素及其频率。
从上面的示例中,预期输出将是
1,频率为4
2,频率为3
3,频率为3
对于我的实际问题,列表的数量和长度是如此巨大,有人能建议我一个简单快速的方法吗?
谢谢,
Prithivi
答案 0 :(得分:1)
帕特里克在评论中写道,chain
和Counter
是您的朋友:
import itertools
import collections
targets = [1,2,3,4,5,6,7]
lists = [
[1,21,35,45,58],
[1,2,15,27,36],
[2,3,1,45,85,51,105,147,201],
[3,458,665]
]
chained = itertools.chain(*lists)
counter = collections.Counter(chained)
result = [(t, counter[t]) for t in targets if counter[t] >= 2]
这样
>>> results
[(1, 3), (2, 2), (3, 2)]
你说你有很多名单,每个名单都很长。试试这个解决方案,看看需要多长时间。如果它需要加速,那就是另一个问题。可能collections.Counter
对您的申请来说太慢了。
答案 1 :(得分:1)
a1= [1,2,3,4,5,6,7]
a2= [1,21,35,45,58]
a3= [1,2,15,27,36]
a4= [2,3,1,45,85,51,105,147,201]
a5= [3,458,665]
b = a1+a2+a3+a4+a5 #make b all lists together
for x in set(b): #iterate though b's set
print(x, 'with a frequency of', b.count(x)) #print the count
会给你:
1 with a frequency of 4
2 with a frequency of 3
3 with a frequency of 3
4 with a frequency of 1
5 with a frequency of 1
6 with a frequency of 1
7 with a frequency of 1
35 with a frequency of 1
36 with a frequency of 1
...
编辑:
使用:
for x in range(9000):
a1.append(random.randint(1,10000))
a2.append(random.randint(1,10000))
a3.append(random.randint(1,10000))
a4.append(random.randint(1,10000))
我使用time
制作了更长的列表。我检查了程序花了多长时间(它没有打印但是保存了信息),程序耗时4.9395秒。我希望这足够快。
答案 2 :(得分:1)
使用pandas的这个解决方案非常快
import pandas as pd
a1=[1,2,3,4,5,6,7]
a2=[1,21,35,45,58]
a3=[1,2,15,27,36]
a4=[2,3,1,45,85,51,105,147,201]
a5=[3,458,665]
# convert each list to a DataFrame with an indicator column
A = [a1, a2, a3, a4, a5]
D = [ pd.DataFrame({'A': a, 'ind{0}'.format(i):[1]*len(a)}) for i,a in enumerate(A)]
# left join each dataframe onto a1
# if you know the integers are distinct then you don't need drop_duplicates
df = pd.merge(D[0], D[1].drop_duplicates(['A']), how='left', on='A')
for d in D[2:]:
df = pd.merge(df, d.drop_duplicates(['A']), how='left', on='A')
# sum accross the indicators
df['freq'] = df[['ind{0}'.format(i) for i,d in enumerate(D)]].sum(axis=1)
# drop frequencies less than 3
print df[['A','freq']].loc[df['freq'] >= 3]
使用较大输入的测试在我的机器上以低于0.2秒的速度运行
import numpy.random as npr
a1 = xrange(10000)
a2 = npr.randint(10000, size=100000)
a3 = npr.randint(10000, size=100000)
a4 = npr.randint(10000, size=100000)
a5 = npr.randint(10000, size=100000)