将子列表排序为新的子列表?

时间:2011-08-18 20:55:46

标签: python list sorting

我有大量的两元子列表,它们是名为mylist的列表的成员:

mylist = [['AB001', 22100],
          ['AB001', 32935],
          ['XC013', 99834],
          ['VD126', 18884],
          ['AB001', 34439],
          ['XC013', 86701]]

我想根据子列表是否包含与第一个项目相同的字符串,将mylist排序到新的子列表中。例如,这就是我要输出的代码:

newlist = [['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
          [['XC013', 99834], ['XC013', 86701]],
          [['VD126', 18884]]

以下是我尝试对此进行编码的方式:

mylist = sorted(mylist)
newlist = []
for sublist in mylist:
    id = sublist[0]
if id == next.id:
    newlist.append(id)
print newlist

我还试图了解itertools.groupby()是否是解决此问题的正确工具。有人可以帮我解决这个问题吗?提前谢谢。

4 个答案:

答案 0 :(得分:4)

请注意,您不能使用以前导零开头的十进制文字。我假设这实际上存储在一个变量中,因此没有前导零。

你认为这是groupby的工作是正确的:

from itertools import groupby
from operator import itemgetter

mylist = [['AB001', 22100],
          ['AB001', 32935],
          ['XC013', 99834],
          ['VD126', 18884],
          ['AB001', 4439],
          ['XC013', 86701]]

print [list(value) for key, value in groupby(sorted(mylist), key=itemgetter(0))]

这将为您提供一个列表,按子列表中的第一项分组。

[[['AB001', 4439], ['AB001', 22100], ['AB001', 32935]], 
 [['VD126', 18884]], 
 [['XC013', 86701], ['XC013', 99834]]]

答案 1 :(得分:1)

有许多替代方法可以解决此问题:

def regroup_by_di(items, key=None):
    result = {}
    callable_key = callable(key)
    for item in items:
        key_value = key(item) if callable_key else item
        if key_value not in result:
            result[key_value] = []
        result[key_value].append(item)
    return result
import collections


def regroup_by_dd(items, key=None):
    result = collections.defaultdict(list)
    callable_key = callable(key)
    for item in items:
        result[key(item) if callable_key else item].append(item)
    return dict(result)  # to be in line with other solutions
def regroup_by_sd(items, key=None):
    result = {}
    callable_key = callable(key)
    for item in items:
        key_value = key(item) if callable_key else item
        result.setdefault(key_value, []).append(item)
    return result
import itertools


def regroup_by_it(items, key=None):
    seq = sorted(items, key=key)
    result = {
        key_value: list(group)
        for key_value, group in itertools.groupby(seq, key)}
    return result
def group_by(
        seq,
        key=None):
    items = iter(seq)
    try:
        item = next(items)
    except StopIteration:
        return
    else:
        callable_key = callable(key)
        last = key(item) if callable_key else item
        i = j = 0
        for i, item in enumerate(items, 1):
            current = key(item) if callable_key else item
            if last != current:
                yield last, seq[j:i]
                last = current
                j = i
        if i >= j:
            yield last, seq[j:i + 1]


def regroup_by_gb(items, key=None):
    return dict(group_by(sorted(items, key=key), key))

这些可以分为两类:

  1. 在输入中循环创建类似dict的结构(regroup_by_di()regroup_by_dd()regroup_by_sd()
  2. 对输入进行排序,然后使用类似uniq的函数(例如itertools.groupby())(regroup_by_it()regroup_by_gb()

第一类方法的计算复杂度为O(n),而第二类方法的计算复杂度为O(n log n)

所有提议的方法都需要指定key。 对于OP的问题,可以使用operators.itemgetter(0)lambda x: x[0]。此外,要获得OP的理想结果,应该只获得list(dict.values()),例如:

from operator import itemgetter


mylist = [['AB001', 22100],
          ['AB001', 32935],
          ['XC013', 99834],
          ['VD126', 18884],
          ['AB001', 4439],
          ['XC013', 86701]]


print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
# [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]

对于所有基于dict的(第一类)解决方案,计时速度更快,而对于所有基于groupby的第二类解决方案的计时速度都慢。 在基于dict的解决方案中,它们的性能将在一定程度上取决于“碰撞率” ,后者与新项目创建新对象的次数成正比。 对于较高的碰撞率,regroup_by_di()可能是最快的,而对于较低的碰撞率,regroup_by_dd()可能是最快的。

基准测试如下:

  • 0.1%的碰撞率(每组约1000个元素)

plot_0p1perc

  • 10%的碰撞率(每组约10个元素)

plot_10perc

  • 50%的碰撞率(每组大约2个元素)

plot_50perc

  • 100%的碰撞率(每组大约1个元素)

plot_100perc

(更多详细信息,here。)

答案 2 :(得分:0)

collections.defaultdict

import { ZonedDateTime, LocalDate, ZoneId} from "js-joda"; import 'js-joda-timezone'; const aTimeWhenLondonIsAlreadyInTomorrow = "2019-04-15T00:00:05.000Z"; const inBetweenTimeInLondon = ZonedDateTime.parse(aTimeWhenLondonIsAlreadyInTomorrow); const inBetweenTimeInNYC = inBetweenTimeInLondon.withZoneSameInstant(ZoneId.of("America/New_York")) const dayInLondon = inBetweenTimeInLondon.toLocalDate(); const dayInNYC = inBetweenTimeInNYC.toLocalDate(); console.log(inBetweenTimeInLondon.toString()); // "2019-04-15T00:00:05Z" console.log(dayInLondon.toString()); // "2019-04-15" console.log(inBetweenTimeInNYC.toString()) // "2019-04-14T20:00:05-04:00[America/New_York]" console.log(dayInNYC.toString()); // "2019-04-14" 解决方案将产生O( n log n )成本,因为必须先对输入进行 排序。您可以使用itertools.groupby的列表来保证O( n )的解决方案:

defaultdict

答案 3 :(得分:0)

不导入任何软件包:

  • 构建字典,然后将值获取到列表中
    • 使用.get来确定密钥是否存在,如果密钥不存在,请返回某个指定的值library(ComplexHeatmap) library(RColorBrewer) Import data: x = readRDS("data.rds") #Select number of clusters to plot: clusters = 7 #Generate dendrogram to make block annotation: dend = as.dendrogram(hclust(as.dist(1- cor(t(x))),"complete"), hang=-1) #Create annotation block for heatmap cl_num = cutree(dend, k = clusters) cl_col = brewer.pal(clusters,"Set2") names(cl_col) = unique(cl_num) cl_col = list(cl_num=cl_col) row_ha = rowAnnotation(cl_num = cl_num, col = cl_col, annotation_legend_param = list(title = "Cluster"), show_annotation_name = F) ###Create heatmap: Heatmap(x, clustering_distance_columns = "euclidean", clustering_method_columns = "complete", clustering_distance_rows = "pearson", clustering_method_rows = "complete", row_split = clusters, left_annotation = row_ha, show_row_names = F, heatmap_legend_param = list(title = "Z-score"), column_title = "Treatment", row_title = "Genes", use_raster = F, column_dend_height = unit(0.1, "cm"), column_names_gp = gpar(fontsize = 16), column_names_rot = 0, column_names_centered = TRUE)
    • dict.get的默认值为None,因此此方法从不引发KeyError
      • 如果None是字典中的值,则更改None返回的默认值。
        • .get
  • 因为:基于共同的第一项的子列表
    • 将索引0添加为test.get(t[0], 'something here'),然后将keylist添加为t值。
dict