通过Python列表过滤以查找常见元素

时间:2016-01-29 16:39:04

标签: python list tuples grouping

我试图找到一种有效的方法来遍历列表中的元素,并将公共元素组合到另一个列表grouplist中。

〔实施例

In[]: grouplist = []

In[]: filelist
Out[]:['C:\\West-California-North-10.xlsx', 
        'C:\\West-California-North-5.xlsx', 
        'C:\\West-California-East-1.xlsx', 
        'C:\\West-California-South-1.xlsx',
        'C:\\South-California-North-5.xlsx',
        'C:\\West-California-South-3.xlsx']

我想找到一组具有不同整数的常见模式。所以在这种情况下,

第一次迭代grouplist =

 C:\\West-California-North-10.xlsx
 C:\\West-California-North-5.xlsx

第二次迭代=

 C:\\West-California-East-1.xlsx

第三次迭代=

 C:\\West-California-South-1.xlsx
 C:\\West-California-South-3.xlsx

5 个答案:

答案 0 :(得分:2)

itertools.groupby是你的朋友:

from itertools import groupby


filelist = [
    'C:\\West-California-North-10.xlsx', 
    'C:\\West-California-North-5.xlsx', 
    'C:\\West-California-East-1.xlsx', 
    'C:\\West-California-South-1.xlsx',
    'C:\\South-California-North-5.xlsx',
    'C:\\West-California-South-3.xlsx']

key_fn = lambda s: s.rsplit('-',1)[0]

# before grouping, list has to be sorted
filelist = sorted(filelist, key=key_fn)

# usually use the same key_fn for grouping as was used for sorting
for key, grouped_file_names in groupby(filelist, key=key_fn):
    # groupby returns an iterator of tuples
    # the first element of the tuple is the grouped key value
    # the second element is a generator of the items that matched that key
    # (YOU MUST CONSUME THIS GENERATOR BEFORE MOVING ON TO THE NEXT KEY)
    print '\n'.join(list(grouped_file_names))
    print

打印

C:\South-California-North-5.xlsx

C:\West-California-East-1.xlsx

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

答案 1 :(得分:1)

您可以使用字典来根据位置名称对路径进行分类。

要将位置名称与尾随ID分开,您可以使用str.rsplit(),然后通过在其中传递dict.setdefault()对象来使用set()方法,以保留唯一名称:

>>> lst=['C:\\West-California-North-10.xlsx', 'C:\\West-California-North-5.xlsx','C:\\West-California-East-1.xlsx','C:\\West-California-South-1.xlsx','C:\\South-California-North-5.xlsx','C:\\West-California-South-3.xlsx']

>>> d = {}
>>> new = [path.rsplit('-',1) for path in lst]

>>> for i,j in new:
...     d.setdefault(i,set()).add(i+'-'+j)
... 

>>> d.values()
[set(['C:\\West-California-East-1.xlsx']),
 set(['C:\\West-California-North-10.xlsx','C:\\West-California-North-5.xlsx']), 
 set(['C:\\South-California-North-5.xlsx']),
 set(['C:\\West-California-South-1.xlsx', 'C:\\West-California-South-3.xlsx'])]
>>> 

答案 2 :(得分:1)

使用defaultdict

from collections import defaultdict
d = defaultdict(set)

for fle in l:
    k, rest = fle.rsplit("-", 1)
    d[k].add("{}-{}".format(k, rest))

for k,v in d.items():
    print "\n".join(v)
    print

输出:

C:\West-California-East-1.xlsx

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

C:\South-California-North-5.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

如果您想保留首次看到元素的顺序,请使用OrderedDict

from collections import OrderedDict
d = OrderedDict()

for fle in l:
    k, rest = fle.rsplit("-", 1)
    d.setdefault(k,set()).add("{}-{}".format(k, rest))

for k,v in d.items():
    print "\n".join(v)
    print

输出:

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

C:\West-California-East-1.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

C:\South-California-North-5.xlsx

如果您的姓名中没有数字,您也可以str.translate而不是分割:

from collections import defaultdict
d = defaultdict(set)

for fle in l:
    d[fle.translate(None,"0123456789")].add(fle)

for k,v in d.items():
    print "\n".join(v)
    print

输出:

C:\West-California-East-1.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

C:\South-California-North-5.xlsx

C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx

答案 3 :(得分:1)

使用sortedregex怎么样?您可以修改并对此排序有更多控制权 - 只需更改sorter功能。

import re

d = ['C:\\West-California-North-10.xlsx', 
        'C:\\West-California-North-5.xlsx', 
        'C:\\West-California-East-1.xlsx', 
        'C:\\West-California-South-3.xlsx',
        'C:\\West-California-South-1.xlsx',
        'C:\\South-California-North-5.xlsx',
        'C:\\West-California-South-3.xlsx']

def sorter(s):
    direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
    direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
    num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
    return direction1,direction2,num
dd =  sorted(d,key=sorter)

for t in dd:
    print t

输出 -

C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-5.xlsx
C:\West-California-North-10.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-3.xlsx

自定义sorter函数的示例 -

如果您更改以下分拣机功能,即根据数字丢弃分拣 -

def sorter(s):
    direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
    direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
    num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
    return direction1,direction2# omitted num here

然后输出 -

C:\South-California-North-5.xlsx
C:\West-California-East-1.xlsx
C:\West-California-North-10.xlsx
C:\West-California-North-5.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx

工作证明

毕竟你可以按照下面的方式迭代它们 -

import re
from collections import defaultdict,OrderedDict

d = ['C:\\West-California-North-10.xlsx', 
        'C:\\West-California-North-5.xlsx', 
        'C:\\West-California-East-1.xlsx', 
        'C:\\West-California-South-3.xlsx',
        'C:\\West-California-South-1.xlsx',
        'C:\\South-California-North-5.xlsx',
        'C:\\West-California-South-3.xlsx']

group_data = defaultdict(list)

def sorter(s):
    direction1 = re.findall(r'(\w+)-California-',s)[0]#first West/South
    direction2 = re.findall(r'California-(\w+)',s)[0]#second West/South
    num = int(re.findall(r'California-\w+-(\w+)',s)[0])#10 r 5 or 1 or 3
    return direction1,direction2,num
dd =  sorted(d,key=sorter)

for t in dd:
    key = re.findall(r'([^\d]+)\d',t)[0]
    group_data[key].append(t)

dt = OrderedDict(sorted(group_data.items(),key=lambda x: x[0]))
for it in dt:
    print '\n'.join(dt[it])+'\n'

输出 -

C:\South-California-North-5.xlsx

C:\West-California-East-1.xlsx

C:\West-California-North-5.xlsx
C:\West-California-North-10.xlsx

C:\West-California-South-1.xlsx
C:\West-California-South-3.xlsx
C:\West-California-South-3.xlsx

答案 4 :(得分:1)

这是使用正则表达式和itertools.groupby的另一种方法:

import re
from itertools import groupby

filelist =  ['C:\\West-California-North-10.xlsx', 
            'C:\\West-California-North-5.xlsx', 
            'C:\\West-California-East-1.xlsx', 
            'C:\\West-California-South-1.xlsx',
            'C:\\South-California-North-5.xlsx',
            'C:\\West-California-South-3.xlsx']

keyfunc = lambda x: re.match('(.*)-\d+\.xlsx', x).group(1)    
keys = [ keyfunc(f) for f in filelist]
grouplist = [list(v) for k,v in groupby(sorted(filelist), key = keyfunc)][::-1]
for group in grouplist: print group, '\r\n'

输出:

['C:\\West-California-South-1.xlsx', 'C:\\West-California-South-3.xlsx'] 

['C:\\West-California-North-10.xlsx', 'C:\\West-California-North-5.xlsx'] 

['C:\\West-California-East-1.xlsx'] 

['C:\\South-California-North-5.xlsx']