python如何基于正则表达式拆分列表并对其进行排序

时间:2015-10-03 12:23:11

标签: python list split

我有一个包含这样的文件路径的列表:

my_paths = ['/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv','/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv','/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv','/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']

我喜欢根据第二级日期对列表进行排序,例如140616 in 15383_chilo_140616_099_X。所以输出应该是:

['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv', '/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv', '/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv', '/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']

这样做的最佳方法是什么。我无法理解我是否应首先遍历路径,采取如下的第二级:

for my_path in my_paths:    
    (SeqDir,seqFileName) = os.path.split(my_path)
    (SeqDir_remaining,second_level) = os.path.split(SeqDir)

....然后拆分下划线,取日期然后对其进行排序并采取该日期的路径,或使用字典并将日期作为键,将路径作为值(但随后出现问题排序)。

感谢您的帮助。

谢谢!

3 个答案:

答案 0 :(得分:3)

在下划线上拆分三次并将第三个元素转换为int,路径分隔符无关紧要,您只需要第二个和第三个下划线之间的数字:

my_paths = ['/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv','/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv','/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv','/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']

my_list.sort(key=lambda x: int(x.split("_", 3)[2])))

输出:

['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv', 
'/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv', 
'/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv', 
'/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']

如果它们实际上是年/月/日日期,则不需要使用int。

答案 1 :(得分:1)

编写一个函数来提取要排序的东西:

def getdate(item):
    ...

然后

my_paths.sort(key=getdate)

您的getdate功能可能需要比这更好,但您明白了这一点:

>>> import pprint
>>> pprint.pprint(my_paths)
['/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv',
 '/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv',
 '/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv',
 '/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
>>> def getdate(item):
...     start = len('/home/mark/results/chilo/15381_chilo_')
...     end = start + 6
...     return item[start:end]
...
>>> getdate(my_paths[0])
'140618'
>>> my_paths.sort(key=getdate)
>>> pprint.pprint(my_paths)
['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv',
 '/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv',
 '/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv',
 '/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']
>>>

答案 2 :(得分:-1)

def sort_links(my_paths, pattern):
# to sort by chilo_xxxxxx
# pattern = r'(chilo_\d+)'
  import re
  my_paths = sorted(my_paths,key=lambda x : re.search(pattern,x).groups(1)[0])
  return my_paths


my_paths = sorted(my_paths,key=f)
return my_paths

    print(sort_links(my_paths,r'(chilo_\d+)'))

['/home/mark/results/chilo/15382_chilo_140610_099_X/15382_chilo.csv', '/home/mark/results/chilo/15383_chilo_140616_099_X/15383_chilo.csv', '/home/mark/results/chilo/15381_chilo_140618_099_X/15381_chilo.csv', '/home/mark/results/chilo/15384_chilo_140620_099_X/15384_chilo.csv']