python装箱和精简的最有效方法是什么?

时间:2019-08-23 21:18:11

标签: python pandas numpy binning

我需要一种有效的方法,首先将数组分为不同的组,然后将合并后的值减少为每个类别的均值。

我怀疑numpy和pandas是最好的模块,因此我实现了一种幼稚的方法,但是找不到更有效的方法来利用numpy的快速操作进行每一步。

示例

"ALTO"

输出

import numpy as np
import pandas as pd

# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(100),
                   "to_sum": np.random.random(100)})

# Create some bins and assign each value to the right bin.
bins = np.linspace(0, 1, 11)
df["bin"] = pd.cut(df["to_query"], bins)

# Sorting by values in bin probably speeds up assignment.
df = df.sort_values(by="bin")

# Different bin categories.
unique_cats = np.unique(df["bin"])

# Here is where np and pd starts to be limited.
cats = {cat: [] for cat in unique_cats}
for index, row in df.iterrows():
    cats[row["bin"]].append(row["to_query"])
cats = {cat: np.mean(vals) for (cat, vals) in cats.items()}

2 个答案:

答案 0 :(得分:1)

这就是我要做的

import numpy as np
import pandas as pd

# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(100),
                   "to_sum": np.random.random(100)})

然后:

def make_bins(series, bins):
    min_v, max_v = np.min(series), np.max(series)
    epsilon = max(np.finfo(float).eps, np.finfo(float).eps * (min_v - max_v))
    return np.floor((series - min_v) / (max_v - min_v + epsilon) * bins)

df["bins"] = make_bins(df["to_query"], 11)

df.groupby("bins").agg('mean')

结果:

      to_query    to_sum
bins                    
0.0   0.047117  0.554289
1.0   0.161922  0.521029
2.0   0.226992  0.465175
3.0   0.327877  0.592192
4.0   0.420162  0.359697
5.0   0.504586  0.547049
6.0   0.585511  0.350083
7.0   0.685560  0.677394
8.0   0.772207  0.606797
9.0   0.866236  0.516578
10.0  0.946512  0.547876

答案 1 :(得分:1)

最佳方法-时间比较

使用最好的顺序排在第一位

import numpy as np
import pandas as pd

# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(1 * 10 ** 4), "to_sum": np.random.random(1 * 10 ** 4)})
df["bin"] = pd.cut(df["to_query"], np.linspace(0, 1, 11))
df = df.sort_values(by="bin")

作为先决条件。

@不同的代码

%%time
df = df.groupby('bin').mean()

壁挂时间:7.22毫秒

@CJR

%%time
def make_bins(series, bins):
    min_v, max_v = np.min(series), np.max(series)
    epsilon = max(np.finfo(float).eps, np.finfo(float).eps * (min_v - max_v))
    return np.floor((series - min_v) / (max_v - min_v + epsilon) * bins)

df.loc[:, ["bins", "to_sum"]].groupby("bins").agg('mean')

挂墙时间:11.9毫秒

原始方法

%%time
# Different bin categories.
unique_cats = np.unique(df["bin"])

# Here is where np and pd starts to be limited.
cats = {cat: [] for cat in unique_cats}
for index, row in df.iterrows():
    cats[row["bin"]].append(row["to_query"])
cats = {cat: np.mean(vals) for (cat, vals) in cats.items()}

墙壁时间:2.2 s

时间大大节省,谢谢,而且更加简洁!