在两组同步列表中查找匹配项

时间:2012-03-21 12:22:47

标签: python

我有两组同步的列表,如下所示: (通过同步我的意思是cal中的'A'属于cpos中的12,而mal中的'A'属于mpos中的11个)

SET1

cpos = [12, 13, 14, 15]
cal = ['A', 'T', 'C', 'G']

SET2

mpos = [11, 12, 13, 16]
mal = ['A', 'T', 'T', 'G']

我想找到两组之间的匹配,在这个例子中只有一个匹配,cpos& cal为13T,mpos& mal为13T。

我编写了这个脚本,但它只比较了它看起来的索引值,因为匹配字符串是空的:

mat = []
for i in xrange(len(cpos)):
     if mpos[i] == cpos[i] and mal[i] == cal[i]:
             mat.append(cpos[i])

这就是我想要的:

mat = [13]

任何想法如何解决这个问题?

3 个答案:

答案 0 :(得分:7)

cpos = [12, 13, 14, 15]
cal = ['A', 'T', 'C', 'G']

mpos = [11, 12, 13, 16]
mal = ['A', 'T', 'T', 'G']

set1 = set(zip(cpos, cal))
set2 = set(zip(mpos, mal))

print set1 & set2

结果:

## set([(13, 'T')])

根据@Janne Karila的评论,以下内容将更有效:

from itertools import izip
print set(izip(cpos, cal)).intersection(izip(mpos, mal))

时序:

import timeit

repeat = 1

setup = '''
num = 1000000
import random
import string
from itertools import izip
cpos = [random.randint(1, 100) for x in range(num)]
cal = [random.choice(string.letters) for x in range(num)]
mpos = [random.randint(1, 100) for x in range(num)]
mal = [random.choice(string.letters) for x in range(num)]
'''

# izip: 0.38 seconds (Python 2.7.2)
t = timeit.Timer(
     setup = setup,
     stmt = '''set(izip(cpos, cal)).intersection(izip(mpos, mal))'''
)

print "%.2f second" % (t.timeit(number=repeat))



# zip: 0.53 seconds (Python 2.7.2)
t = timeit.Timer(
     setup = setup,
     stmt = '''set(zip(cpos, cal)) & set(zip(mpos, mal))'''
)

print "%.2f second" % (t.timeit(number=repeat))


# Nested loop: 616 seconds (Python 2.7.2)
t = timeit.Timer(
     setup = setup,
     stmt = '''

mat = []
for i in xrange(len(cpos)):
     for j in xrange(len(mpos)):
          if mpos[j] == cpos[i] and mal[j] == cal[i]:
               mat.append(mpos[j]) # or mat.append((mpos[j], mal[j])) ?
               break
'''
)

print "%.2f seconds" % (t.timeit(number=repeat))

答案 1 :(得分:0)

您现在仅通过索引进行比较,即仅在所有列表中的位置i进行比较。但13T在cpos& cal位于1位置,13T位于mpos& mal位于2位置。这意味着您的if语句将不为真,mat将为空。

答案 2 :(得分:0)

您可以在示例中添加第二个循环:

cpos = [12, 13, 14, 15]
cal = ['A', 'T', 'C', 'G']

mpos = [11, 12, 13, 16]
mal = ['A', 'T', 'T', 'G']

mat = []
for i in xrange(len(cpos)):
     for j in xrange(len(mpos)):
          if mpos[j] == cpos[i] and mal[j] == cal[i]:
               mat.append(mpos[j]) # or mat.append((mpos[j], mal[j])) ?

print mat # [13]

..虽然这是非常低效的,如thg435答案的时间表所示