Question

我有一个非常大的数据集，它具有以下索引和列标题。

+------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+
|      | count: interaction_eis | count: interaction_eis_reply | count: interaction_match | count: interaction_single_message_ | count: interaction_single_message_1 | count: interaction_yes | dc(uid): interaction_eis | dc(uid): interaction_eis_reply | dc(uid): interaction_match | dc(uid): interaction_single_message_ | dc(uid): interaction_single_message_1 | dc(uid): interaction_yes |
+------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+
| uid  |                        |                              |                          |                                    |                                     |                        |                          |                                |                            |                                      |                                       |                          |
| 38   |                     36 |                            0 |                        0 |                                 14 |                                   0 |                    163 |                        1 |                              0 |                          0 |                                    1 |                                     0 |                        1 |
| 66   |                     63 |                            0 |                        0 |                                  0 |                                   0 |                      0 |                        1 |                              0 |                          0 |                                    0 |                                     0 |                        0 |
| 1466 |                      0 |                            0 |                        0 |                                  0 |                                   0 |                      1 |                        0 |                              0 |                          0 |                                    0 |                                     0 |                        1 |
| 1709 |                     51 |                            0 |                        0 |                                  1 |                                   0 |                      9 |                        1 |                              0 |                          0 |                                    1 |                                     0 |                        1 |
| 1844 |                     66 |                            0 |                        1 |                                  3 |                                   1 |                     17 |                        1 |                              0 |                          1 |                                    1 |                                     1 |                        1 |
+------+------------------------+------------------------------+--------------------------+------------------------------------+-------------------------------------+------------------------+--------------------------+--------------------------------+----------------------------+--------------------------------------+---------------------------------------+--------------------------+

我正在尝试按收到的交互类型对UID进行分组，如果用户只有一种特定类型的交互，那么它们只会与只有该特定交互类型的其他用户进行分组。

为了做到这一点，我首先采用了每个交互类型只有1“点击”的所有dc（uid）列，如果交互类型从未发生过，则为0并将它们聚合到组中像这样一行一行：

cols = [i for i in list(all_f_rm.columns) if i[0]=="d"]

def aggregate(row):
    key = ""
    for i in cols:
        key+=str(row[i])

    if key not in results:
        results[key] = []
    results[key].append(row.name)

results = {}
all_f_rm.apply(aggregate, axis=1)

results.keys()是所有潜在的交互类型组合（其中35个），每个键的值是属于该组合的每个索引（UID）。它看起来像这样： {'001101': [141168, 153845, 172598, 254401, 448276,...

接下来，我做了一个函数来过滤掉每个组合/键的所有不匹配的行：

def tableFor(key):
    return all_f_rm[all_f_rm.apply(lambda row: row.name in results[key], axis=1)]

tableFor('001101')显示我想要的确切数据框。

我的问题是，我写了一个列表理解来循环遍历所有35个组合，比如这个[tableFor(x) for x in results.keys()]，但它需要永远（1小时以上但还没有完成）我需要在另外5个上执行此操作数据集。有没有更有效的方法来完成我想要做的事情？

Answer 1

IIUC，你可以做你想要的groupby。构建像你这样的玩具数据框：

df = pd.DataFrame({"uid": np.arange(10**6)})
for col in range(6):
    df["dc{}".format(col)] = np.random.randint(0,2,len(df))

我们可以按感兴趣的列进行分组，并快速获取相关的ID号：

>>> dcs = [col for col in df.columns if col.startswith("dc")]
>>> df.groupby(dcs)["uid"].unique()
dc0  dc1  dc2  dc3  dc4  dc5
0    0    0    0    0    0      [302, 357, 383, 474, 526, 614, 802, 812, 865, ...
                         1      [7, 96, 190, 220, 405, 453, 534, 598, 606, 866...
                    1    0      [16, 209, 289, 355, 430, 620, 634, 736, 780, 7...
                         1      [9, 79, 166, 268, 408, 434, 435, 447, 572, 749...
               1    0    0      [60, 120, 196, 222, 238, 346, 426, 486, 536, 5...
                         1      [2, 53, 228, 264, 315, 517, 557, 621, 626, 630...
                    1    0      [42, 124, 287, 292, 300, 338, 341, 350, 500, 5...
                         1      [33, 95, 140, 192, 225, 282, 328, 339, 365, 44...
          1    0    0    0      [1, 59, 108, 134, 506, 551, 781, 823, 836, 861...
                         1      [149, 215, 380, 394, 436, 482, 570, 600, 631, ...
                    1    0      [77, 133, 247, 333, 374, 782, 809, 892, 1096, ...
                         1      [14, 275, 312, 326, 343, 444, 569, 692, 770, 7...
               1    0    0      [69, 104, 143, 404, 431, 468, 636, 639, 657, 7...
                         1      [178, 224, 367, 402, 664, 666, 739, 807, 871, ...
[...]

如果您更喜欢关联的群组，您也可以从中获取列表或词典，而不是简单地删除索引：

>>> groups = list(df.groupby(dcs, as_index=False))
>>> print(groups[0][0])
(0, 0, 0, 0, 0, 0)
>>> print(groups[0][1])
           uid  dc0  dc1  dc2  dc3  dc4  dc5
302        302    0    0    0    0    0    0
357        357    0    0    0    0    0    0
383        383    0    0    0    0    0    0
[...]
999730  999730    0    0    0    0    0    0
999945  999945    0    0    0    0    0    0
999971  999971    0    0    0    0    0    0

[15357 rows x 7 columns]

等等。

如何加速pandas中的复杂/困难数据过滤

1 个答案: