Question

我试图使用pandas来分析相当大的数据集（~5GB）。我想将数据集分成组，然后在每个组上执行笛卡尔积，然后汇总结果。

pandas的apply操作非常具有表现力，我可以先group，然后使用apply对每个组执行笛卡尔积，然后使用{{1}聚合结果}}。然而，这种方法的问题是sum不是懒惰的，它会在聚合之前计算所有中间结果，而中间结果（每组上的笛卡尔生成）非常大。

我在看Apache Spark，发现了一个名为apply的非常有趣的运算符。定义如下：

当调用类型为（K，V）和（K，W）的数据集时，返回（K，Iterable，Iterable）元组的数据集。此操作也称为groupWith。

这似乎正是我想要的。如果我可以先cogroup然后执行cogroup，那么中间结果将不会展开（假设sum以与cogroup相同的懒惰方式工作。）< / p>

在pandas中是否存在类似于group的操作，或者如何有效地实现我的目标？

以下是我的例子：

我希望按cogroup对数据进行分组，然后为每个组执行笛卡尔积，然后按id和cluster_x进行分组并汇总cluster_y和count_x使用count_y。以下代码有效，但速度极慢，占用的内存太多。

sum

玩具数据集

# add dummy_key to do Cartesian product by merge
df['dummy_key'] = 1

def join_group(g):
    return pandas.merge(g, g, on='dummy_key')\
    [['cache_cluster_x', 'count_x', 'cache_cluster_y', 'count_y']]

df_count_stats = df.groupby(['id'], as_index=True).apply(join_group).\
    groupby(['cache_cluster_x', 'cache_cluster_y'], as_index=False)\
    [['count_x', 'count_y']].sum()

id cluster count 0 i1 A 2 1 i1 B 3 2 i2 A 1 3 i2 B 4之后的中间结果（可能很大）

apply

期望的最终结果

     cluster_x  count_x cluster_y  count_y
id                                        
i1 0         A        2         A        2
   1         A        2         B        3
   2         B        3         A        2
   3         B        3         B        3
i2 0         A        1         A        1
   1         A        1         B        4
   2         B        4         A        1
   3         B        4         B        4

Answer 1

我的第一次尝试失败了，有点：虽然我能够限制内存使用（通过对每组中的笛卡尔积进行求和），但它比原来慢得多。但是对于您特定的期望输出，我认为我们可以大大简化问题：

import numpy as np, pandas as pd

def fake_data(nids, nclusters, ntile):
    ids = ["i{}".format(i) for i in range(1,nids+1)]
    clusters = ["A{}".format(i) for i in range(nclusters)]
    df = pd.DataFrame(index=pd.MultiIndex.from_product([ids, clusters], names=["id", "cluster"]))
    df = df.reset_index()
    df = pd.concat([df]*ntile)
    df["count"] = np.random.randint(0, 10, size=len(df))
    return df


def join_group(g):
    m= pd.merge(g, g, on='dummy_key')
    return m[['cluster_x', 'count_x', 'cluster_y', 'count_y']]

def old_method(df):
    df["dummy_key"] = 1
    h1 = df.groupby(['id'], as_index=True).apply(join_group)
    h2 = h1.groupby(['cluster_x', 'cluster_y'], as_index=False)
    h3 = h2[['count_x', 'count_y']].sum()
    return h3

def new_method1(df):
    m1 = df.groupby("cluster", as_index=False)["count"].sum()
    m1["dummy_key"] = 1
    m2 = m1.merge(m1, on="dummy_key")
    m2 = m2.sort_index(axis=1).drop(["dummy_key"], axis=1)
    return m2

给出（df作为你的玩具框架）：

>>> new_method1(df)
  cluster_x cluster_y  count_x  count_y
0         A         A        3        3
1         A         B        3        7
2         B         A        7        3
3         B         B        7        7
>>> df2 = fake_data(100, 100, 1)
>>> %timeit old_method(df2)
1 loops, best of 3: 954 ms per loop
>>> %timeit new_method1(df2)
100 loops, best of 3: 8.58 ms per loop
>>> (old_method(df2) == new_method1(df2)).all().all()
True

甚至

>>> df2 = fake_data(100, 100, 100)
>>> %timeit new_method1(df2)
10 loops, best of 3: 88.8 ms per loop

这是否足以改善您的实际情况，我不确定。

cogroup就像大熊猫的操作一样

1 个答案: