这个清单:
list1=["House of Mine (1293) Item 21",
"House of Mine (1292) Item 24",
"The yard (1000) Item 1 ",
"The yard (1000) Item 2 ",
"The yard (1000) Item 4 "]
我想将它的每个项目添加到一个组(在这种情况下列表中的列表)如果子字符串直到(XXXX)是相同的。
所以,在这种情况下,我希望有:
[["House of Mine (1293) Item 21",
"House of Mine (1292) Item 24"],
["The yard (1000) Item 1 ",
"The yard (1000) Item 2 ",
"The yard (1000) Item 4 "]
以下代码是我能够制作的,但它不起作用:
def group(list1):
group=[]
for i, itemg in enumerate(list1):
try:
group[i]
except Exception:
group.append([])
for itemj in group[i]:
if re.findall(re.split("\(\d{4}\)\(", itemg)[0], itemj):
group[i].append(itemg)
else:
group.append([])
group[-1].append(itemg)
return group
感谢堆栈中的另一个主题,即正则表达式的页面 http://www.diveintopython3.net/regular-expressions.html
我知道答案就在于它,但我很难理解它的一些概念。
答案 0 :(得分:4)
将列表设置为分组:
>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
定义一个函数,用于对项目进行排序和分组(这次使用括号中的数字):
>>> keyf = lambda text: text.split("(")[1].split(")")[0]
>>> keyf
<function __main__.<lambda>>
>>> keyf(list1[0])
'1293'
对列表进行排序(在此处):
>>> list1.sort() #As Adam Smith noted, alphabetical sort is good enough
从itertools
获取groupby>>> from itertools import groupby
检查概念:
>>> for gr, items in groupby(list1, key = keyf):
... print "gr", gr
... print "items", list(items)
...
>>> list1
['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 ',
'House of Mine (1292) Item 24',
'House of Mine (1293) Item 21']
注意,我们必须在项目上调用list
,因为items
是项目的迭代器。
现在使用列表理解:
>>> res = [list(items) for gr, items in groupby(list1, key=keyf)]
>>> res
[['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 '],
['House of Mine (1292) Item 24'],
['House of Mine (1293) Item 21']]
我们已经完成了。
如果您想在第一个"("
之前按所有文字分组,唯一的变化是:
>>> keyf = lambda text: text.split("(")[0]
>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
>>> keyf = lambda text: text.split("(")[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 ']]
re.findall
解决方案假设&#34;(&#34;是分隔符并忽略了那里有四位数的要求。可以使用re
来解决这样的任务。
>>> import re
>>> keyf = lambda text: re.findall(".+(?=\(\d{4}\))", text)[0]
>>> text = 'House of Mine (1293) Item 21'
>>> keyf(text)
'House of Mine '
但如果文本没有预期的内容(我们试图从空列表中访问索引为0的项目),它会引发IndexError: list index out of range
。
>>> text = "nothing here"
IndexError: list index out of range
我们可以使用简单的技巧,为了生存,我们附加原始文本以确保存在的东西:
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> text = "nothing here"
>>> keyf(text)
'nothing here'
使用re
>>> import re
>>> from itertools import groupby
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1292) Item 24', 'House of Mine (1293) Item 21'],
['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 ']]
答案 1 :(得分:4)
我会使用collections.defaultdict
和re.findall
向父母进行预测。
import collections
import re
def groupitems(lst):
groups = collections.defaultdict(list)
for item in lst:
try:
head = re.findall(".+(?=\(\d{4}\))", item)[0]
except IndexError: # there is no (\d{4})
head = item # so take the whole string
groups[head].append(item)
return groups.values()
# if you ABSOLUTELY MUST return a list, cast it here like this:
# return list( groups.values() )
# however a dict_values object is list-like and should quack nicely.
答案 2 :(得分:2)
我会选择一些更简单的东西。在此演示 http://dbgr.cc/8
import re
list1=[
"House of Mine (1293) Item 21",
"House of Mine (1292) Item 24",
"The yard (1000) Item 1 ",
"The yard (1000) Item 2 ",
"The yard (1000) Item 4 "
]
def group_items(lst):
res = {}
reg = re.compile(r"^(.*)\(\d+\).*$")
for item in list1:
match = reg.match(item)
res.setdefault(match.group(1), []).append(item)
return res.values()
print group_items(list1)
输出为:
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'], ['The yard (1000) Item 1 ', 'The yard (1000) Item 2 ', 'The yard (1000) Item 4 ']]
答案 3 :(得分:0)
基于我对Adams Smith提出的defaultdict
的其他答案和使用,这是另一种方法。
它使用text.split
来检测分组键
它使用map
循环值以将它们分配给defaultdict中的正确键
>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
以下是4行代码:
>>> from collections import defaultdict
>>> groups = defaultdict(list)
>>> map(lambda itm: groups[itm.split("(")[0]].append(itm), list1)
[None, None, None, None, None]
>>> groups.values()
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
['The yard (1000) Item 1 ',
'The yard (1000) Item 2 ',
'The yard (1000) Item 4 ']]
无论如何,这假定第一个&#34;(&#34;是分隔符,如果有"The (unexpected) yard (1000) Item 44"
之类的值,它可能无法满足期望并使用{{1}将是要走的路。