在包含子字符串的两个列表中查找元素

时间:2013-03-15 18:33:06

标签: python list-comprehension string-matching

我有两个列表,可能有不同的长度。每个列表都包含字符串形式的文件名。我无法控制名称,但我确信名称结构不会改变。它总是类似于name1_name2_number1 _ +(或 - )number2.jpg

Number1是我想要在两个列表之间匹配的子字符串。如果一个列表中的文件名包含与另一个列表中的文件名相同的number1,我想将这两个文件名附加到第三个列表中。我有一个简单的函数,它将获得给定列表中的number1,例如:

>>>list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg']
>>>def GetNum(imgStrings):
...    ss = []
...    for b in imgStrings:
...        ss.append([w for w in b.split('_') if w.isdigit()])
...    #flatten zee list of lists because it is ugly.
...    return [val for subl in ss for val in subl]
>>>GetNum(list1)
['200', '800]

所以,对于

>>>list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg']
>>>list2 = ['inara03_summer40_8000_-2.jpg', 'book23_42jayne_400_+2.jpg', 'summer53_21simon_300_-1.jpg']
>>>awesomesauceSubstringMatcher(list1, list2)
['inara03_kaley40_8000_-1.jpg', 'inara03_summer40_8000_-2.jpg']

我觉得我应该可以用我的GetNum函数和一些列表理解来做到这一点,但是整个'[等等......)的句法中的狡猾对我来说是新的,我不能我非常喜欢这个。思考?建议?死亡威胁?感谢所有有用的回复,如果我的googlefu在试图找到类似的问题/答案时让我失败,那么就会有一千道歉。

修改 我只想出这个解决方案:

[str for str in list1+list2 if any(subs in str for subs in GetNum(list1)) and any(subs in str for subs in GetNum(list2))]

我知道这很长很丑,但我真的想向自己证明它可以用列表理解来完成。感谢所有有用的回复!

6 个答案:

答案 0 :(得分:1)

list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg']
list2 = ['inara03_summer40_8000_-2.jpg', 'book23_42jayne_400_+2.jpg', 'summer53_21simon_300_-1.jpg']

def getNum(image_name_list):
    for s in image_name_list:
        s = s.split('_')[2]
        if s.isdigit():
           yield s        
        else:
            yield None

def getMatchingIndex(list1, list2):
    other_list = list(getNum(list2))
    for (i, num) in enumerate(getNum(list1)):
        if not num:
            continue
        for (j, other_num) in enumerate(getNum(list2)):
            if (num == other_num):
                yield (i, j)

for i1, i2 in getMatchingIndex(list1, list2):
    print list1[i1], list2[i2]

由于我们只需要一次比较一个项目到第二个列表中的每一次,我在getNum中使用了一个生成器来节省内存。由于数字可能不止一次匹配,我会不断检查每个项目。

答案 1 :(得分:0)

未经测试,但逻辑应该是正确的:

list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg']
list2 = ['inara03_summer40_8000_-2.jpg', 'book23_42jayne_400_+2.jpg', 'summer53_21simon_300_-1.jpg']
list3 = []

seenInList1Dict = {}

for element in list1:
    splitelem = element.split('_')
    seenInList1Dict[splitelem[2]] = 1

for element in list2:
    splitelem = element.split('_')
    if splitelem[2] in seenInList1Dict:
        list3.append(element)

我没有使用你的GetNum因为它不必要地使IMO变得复杂。如果你想稍后快速查找/比较它们的存在,我发现将事物转储到字典中会更容易。此外,如果您需要该号码,您只需要对文件名执行split并从相应的索引中获取所需的值。

答案 2 :(得分:0)

我会为两个列表构建一个字典,其中键是文件名中的数字,值是文件名本身。然后“交叉”两组密钥,然后可以使用生成的公共密钥来构建第三个列表,例如:

def List2Dic(List):
    return dict(map(lambda x: [ x.split("_")[2], x], List))

list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg']
list2 = ['inara03_summer40_8000_-2.jpg', 'book23_42jayne_400_+2.jpg', 'summer53_21simon_300_-1.jpg']

d1 = List2Dic(list1)
d2 = List2Dic(list2)

for x in set(d1) & set(d2):
    print d1[x], d2[x]

答案 3 :(得分:0)

将字符串解析为您可以实际筛选的数据。事情会好得多。

def process(filename):
    splitup = filename.rstrip('.jpg').split('_')
    keys = ["name1", "name2", "number1", "number2"]
    r = dict(zip(keys, splitup))
    r["filename"] = filename
    return r

list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg']
list2 = ['inara03_summer40_8000_-2.jpg', 'book23_42jayne_400_+2.jpg', 'summer53_21simon_300_-1.jpg']

plist1 = [process(f) for f in list1]
plist2 = [process(f) for f in list2]

nlist1 = [i['number1'] for i in plist1]
nlist2 = [i['number1'] for i in plist2]

ilist1 = [i for i in plist1 if i['number1'] in nlist2]
ilist2 = [i for i in plist2 if i['number1'] in nlist1]

intersection = set([i["filename"] for i in ilist1 + ilist2])

for i in intersection:
    print i

编辑:拍摄,我现在看到你想要两个列表中的交叉点。

答案 4 :(得分:0)

My bit of the solution using map,reduce, filter and list flattening using sum:-
l=['a_b_1_2','b_c_2_3']
s=['c_d_3_4','d_e_1_4']
a=map(lambda y: map(lambda z:  [y,z] if y[2] == z[2] else '', map(lambda v:v.split('_'), s)),map(lambda x:x.split('_'),l))

map(lambda x: '_'.join(x), sum(filter(lambda qq: qq is not '',sum(a,[]))))

在实际数据集上显示:

>>> list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg']    
>>> list2 = ['inara03_summer40_8000_-2.jpg', 'book23_42jayne_400_+2.jpg', 'summer53_21simon_300_-1.jpg']

>>> a=map(lambda y: map(lambda z:  [y,z] if y[2] == z[2] else '', map(lambda v:v.split('_'), list2)),map(lambda x:x.split('_'),list1))

>>> a 

    [['', '', ''], [[['inara03', 'kaley40', '8000', '-1.jpg'], ['inara03', 'summer40', '8000', '-2.jpg']], '', '']]


>>> sum(filter(lambda qq: qq is not '',sum(a,[])),[])

    [['inara03', 'kaley40', '8000', '-1.jpg'], ['inara03', 'summer40', '8000', '-2.jpg']]

>>> map(lambda x: '_'.join(x), sum(filter(lambda qq: qq is not '',sum(a,[])),[]))

    ['inara03_kaley40_8000_-1.jpg', 'inara03_summer40_8000_-2.jpg'] #This is the output you want.

答案 5 :(得分:0)

这将返回两个列表中所有匹配值的列表。例如,如果存在数字8000和300的匹配项,它将为每个可能的数字返回一个列表的列表,然后仅使用匹配项填充列表。

list1 = ['serentity01_20malcolm_200_+3.jpg','inara03_kaley40_8000_-1.jpg',
         'inara03_34simon_300_+1.jpg']
list2 = ['inara03_summer40_8000_-2.jpg', 'book23_42jayne_400_+2.jpg',
         'summer53_21simon_300_-1.jpg']

def GetNum(imgStrings):
    ss = []
    for b in imgStrings:
        ss.append([w for w in b.split('_') if w.isdigit()])
        #flatten zee list of lists because it is ugly.
    return [val for subl in ss for val in subl]


print GetNum(list1)



def addToThird(input1, input2):

    numlist1 = GetNum(input1)
    numlist2 = GetNum(input2)

    numgroups = set(numlist1 + numlist2)
    numgroups = list(numgroups)
    collectionsList = []

    for i in numgroups:
    collectionsList.append([])

    for item1 in numlist1:
        for item2 in numlist2:
            if item1 == item2:
                print item1, item2
                goindex = numgroups.index(item1)
                collectionsList[goindex].append(input1[numlist1.index(item1)])
                collectionsList[goindex].append(input1[numlist2.index(item2)])
    return collectionsList


print addToThird(list1, list2)

输出:

['200', '8000', '300']
8000 8000
300 300
[['inara03_34simon_300_+1.jpg', 'inara03_34simon_300_+1.jpg'], [], 
'inara03_kaley40_8000_-1.jpg', 'serentity01_20malcolm_200_+3.jpg'], []]