从行数据创建列表

时间:2017-10-20 15:24:45

标签: python-3.x pandas

我的输入数据具有以下格式

id  offset  code
 1      3    21
 1      3    24
 1      5    21
 2      1    84
 3      5    57
 3      5    21
 3      5    92
 3     10    83
 3     10    21

我希望输出采用以下格式

id   offset                   code
 1    [3,5]         [[21,24],[21]]
 2      [1]                 [[84]]
 3   [5,10]   [[21,57,92],[21,83]]

我能够提出的代码如下所示

import random, pandas
random.seed(10000)

param = dict(nrow=100, nid=10, noffset=8, ncode=100)
#param = dict(nrow=1000, nid=10, noffset=8, ncode=100)
#param = dict(nrow=100000, nid=1000, noffset=50, ncode=5000)
#param = dict(nrow=10000000, nid=10000, noffset=100, ncode=5000)

pd = pandas.DataFrame({
    "id":random.choices(range(1,param["nid"]+1), k=param["nrow"]), 
    "offset":random.choices(range(param["noffset"]), k=param["nrow"])
})
pd["code"] = random.choices(range(param["ncode"]), k=param["nrow"])
pd = pd.sort_values(["id","offset","code"]).reset_index(drop=True)

tmp1 = pd.groupby(by=["id"])["offset"].apply(lambda x:list(set(x))).reset_index()
tmp2 = pd.groupby(by=["id","offset"])["code"].apply(lambda x:list(x)).reset_index().groupby(\
    by=["id"], sort=True)["code"].apply(lambda x:list(x)).reset_index()

out = pandas.merge(tmp1, tmp2, on="id", sort=False)

它确实给了我想要的输出但是当数据帧很大时非常慢。我拥有的数据框有超过4000万行。在示例中  取消注释第四个参数语句,你会看到它有多慢。

请帮助你加快运行速度吗?

1 个答案:

答案 0 :(得分:5)

public CompletableFuture<Result> getFuture(long timeOut, TimeUnit u) {
    CompletableFuture<A> resultA = serviceA.call();
    CompletableFuture<B> resultB = resultA.thenCompose(a -> serviceB.call(a));
    CompletableFuture<C> resultC = resultA.thenCompose(a -> serviceC.call(a));
    ScheduledExecutorService e = Executors.newSingleThreadScheduledExecutor();
    e.schedule(() -> resultB.complete(fallbackB), timeOut, u);
    e.schedule(() -> resultC.complete(fallbackC), timeOut, u);
    CompletableFuture<Void> bAndC = CompletableFuture.allOf(resultB, resultC);
    bAndC.thenRun(e::shutdown);
    return bAndC.thenApply(ignoredVoid ->
                           combine(resultA.join(), resultB.join(), resultC.join()));
}