使用`re.findall`

Question

这个清单：

list1=["House of Mine (1293) Item 21",
       "House of Mine (1292) Item 24",
       "The yard (1000) Item 1 ",
       "The yard (1000) Item 2 ",
       "The yard (1000) Item 4 "]

我想将它的每个项目添加到一个组（在这种情况下列表中的列表）如果子字符串直到（XXXX）是相同的。

所以，在这种情况下，我希望有：

[["House of Mine (1293) Item 21",
  "House of Mine (1292) Item 24"],

 ["The yard (1000) Item 1 ",
  "The yard (1000) Item 2 ",
  "The yard (1000) Item 4 "]

以下代码是我能够制作的，但它不起作用：

def group(list1):
    group=[]
    for i, itemg in enumerate(list1):
        try:
            group[i]
        except Exception:
            group.append([])
        for itemj in group[i]:
            if re.findall(re.split("\(\d{4}\)\(", itemg)[0], itemj):
                group[i].append(itemg)
            else:
                group.append([])
                group[-1].append(itemg)

    return group

感谢堆栈中的另一个主题，即正则表达式的页面 http://www.diveintopython3.net/regular-expressions.html

我知道答案就在于它，但我很难理解它的一些概念。

Answer 1

将列表设置为分组：

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

定义一个函数，用于对项目进行排序和分组（这次使用括号中的数字）：

>>> keyf = lambda text: text.split("(")[1].split(")")[0]
>>> keyf
<function __main__.<lambda>>
>>> keyf(list1[0])
'1293'

对列表进行排序（在此处）：

>>> list1.sort() #As Adam Smith noted, alphabetical sort is good enough

从itertools

获取groupby

>>> from itertools import groupby

检查概念：

>>> for gr, items in groupby(list1, key = keyf):
...     print "gr", gr
...     print "items", list(items)
...
>>> list1
['The yard (1000) Item 1 ',
 'The yard (1000) Item 2 ',
 'The yard (1000) Item 4 ',
 'House of Mine (1292) Item 24',
 'House of Mine (1293) Item 21']

注意，我们必须在项目上调用list，因为items是项目的迭代器。

现在使用列表理解：

>>> res = [list(items) for gr, items in groupby(list1, key=keyf)]
>>> res
[['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 '],
 ['House of Mine (1292) Item 24'],
 ['House of Mine (1293) Item 21']]

我们已经完成了。

如果您想在第一个"("之前按所有文字分组，唯一的变化是：

>>> keyf = lambda text: text.split("(")[0]

简短版本回答OP

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]
>>> keyf = lambda text: text.split("(")[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

使用`re.findall`

的变体

解决方案假设＆＃34;（＆＃34;是分隔符并忽略了那里有四位数的要求。可以使用re来解决这样的任务。

>>> import re
>>> keyf = lambda text: re.findall(".+(?=\(\d{4}\))", text)[0]
>>> text = 'House of Mine (1293) Item 21'
>>> keyf(text)
'House of Mine '

但如果文本没有预期的内容（我们试图从空列表中访问索引为0的项目），它会引发IndexError: list index out of range。

>>> text = "nothing here"
IndexError: list index out of range

我们可以使用简单的技巧，为了生存，我们附加原始文本以确保存在的东西：

>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> text = "nothing here"
>>> keyf(text)
'nothing here'

使用re

的最终解决方案

>>> import re
>>> from itertools import groupby
>>> keyf = lambda text: (re.findall(".+(?=\(\d{4}\))", text) + [text])[0]
>>> [list(items) for gr, items in groupby(sorted(list1), key=keyf)]
[['House of Mine (1292) Item 24', 'House of Mine (1293) Item 21'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

Answer 2

我会使用collections.defaultdict和re.findall向父母进行预测。

import collections
import re

def groupitems(lst):
    groups = collections.defaultdict(list)

    for item in lst:
        try:
            head = re.findall(".+(?=\(\d{4}\))", item)[0]
        except IndexError: # there is no (\d{4})
            head = item # so take the whole string
        groups[head].append(item)

    return groups.values()
    # if you ABSOLUTELY MUST return a list, cast it here like this:
    #   return list( groups.values() )
    # however a dict_values object is list-like and should quack nicely.

Answer 3

我会选择一些更简单的东西。在此演示 http://dbgr.cc/8

import re

list1=[
    "House of Mine (1293) Item 21",
    "House of Mine (1292) Item 24",
    "The yard (1000) Item 1 ",
    "The yard (1000) Item 2 ",
    "The yard (1000) Item 4 "
]

def group_items(lst):
    res = {}
    reg = re.compile(r"^(.*)\(\d+\).*$")
    for item in list1:
        match = reg.match(item)
        res.setdefault(match.group(1), []).append(item)

    return res.values()

print group_items(list1)

输出为：

[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'], ['The yard (1000) Item 1 ', 'The yard (1000) Item 2 ', 'The yard (1000) Item 4 ']]

Answer 4

基于我对Adams Smith提出的defaultdict的其他答案和使用，这是另一种方法。

它使用text.split来检测分组键

它使用map循环值以将它们分配给defaultdict中的正确键

>>> list1=["House of Mine (1293) Item 21","House of Mine (1292) Item 24", "The yard (1000) Item 1 ", "The yard (1000) Item 2 ", "The yard (1000) Item 4 "]

以下是4行代码：

>>> from collections import defaultdict
>>> groups = defaultdict(list)
>>> map(lambda itm: groups[itm.split("(")[0]].append(itm), list1)
[None, None, None, None, None]
>>> groups.values()
[['House of Mine (1293) Item 21', 'House of Mine (1292) Item 24'],
 ['The yard (1000) Item 1 ',
  'The yard (1000) Item 2 ',
  'The yard (1000) Item 4 ']]

无论如何，这假定第一个＆＃34;（＆＃34;是分隔符，如果有"The (unexpected) yard (1000) Item 44"之类的值，它可能无法满足期望并使用{{1}将是要走的路。

在python中按字符串模式对项目进行分组

4 个答案:

简短版本回答OP

使用`re.findall`

在python中按字符串模式对项目进行分组

4 个答案:

简短版本回答OP

使用re.findall

使用`re.findall`