熊猫在两个系列之间进行逐元素比较的最佳方法

时间:2020-05-12 10:07:28

标签: python pandas dataframe

我有两个大熊猫系列:

$kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE mynginx1 NodePort 10.97.142.170 <none> 80:31591/TCP 8m9s 可能具有非常多的行,并且有些--- apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx ports: - name: http containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: nginx-svc spec: selector: app: nginx ports: - protocol: TCP port: 80 targetPort: 80 type: NodePort $ curl -IL http://localhost:31591 HTTP/1.1 200 OK Server: nginx/1.17.10 Date: Tue, 12 May 2020 10:05:04 GMT Content-Type: text/html Content-Length: 612 Last-Modified: Tue, 14 Apr 2020 14:19:26 GMT Connection: keep-alive ETag: "5e95c66e-264" Accept-Ranges: bytes (这是数据帧(s1中的一列),只有20行。 br /> 两个系列的索引不同。

NaN

对于s2中的每个df,我要检索s1: id 1 4.5 2 15.0 3 13.0 4 14.0 5 18.0 6 15.0 7 13.0 8 14.0 9 NaN 10 NaN 11 NaN 12 18.0 13 NaN 14 NaN 15 NaN df: col1 s2 0 20.0 0.0 1 19.0 4.5 2 18.0 5.0 3 17.0 6.0 4 16.0 7.0 5 15.0 8.0 6 14.0 9.0 7 13.0 10.0 8 12.0 11.0 9 11.0 12.0 10 10.0 13.0 11 9.0 15.0 12 8.0 16.0 13 7.0 18.0 14 6.0 20.0 15 5.0 22.0 16 4.0 24.0 17 3.0 26.0 18 2.0 28.0 19 1.0 100.0 中第一个元素小于或等于id的{​​{1}}的值。

即对于s1,我们有col1,它小于或等于s2,因此我想检索值id。 因此,对于id 1中的s1 = 4.5,我需要检索df.s2 = 4.5中的值19

这是我当前的解决方案。我想知道是否有更好的方法(更快,也许是熊猫函数?)来获得相同的结果:

id=2

2 个答案:

答案 0 :(得分:3)

想法是使用numpy并将2d数组中列中的每个值与Series的每个值进行比较,然后传递到numpy.where,如果不匹配则设置NaN,最后使用{{3 }}:

m = df['s2'].to_numpy() <= s1.to_numpy()[:, None]

a = np.nanmin(np.where(m, df['col1'], np.nan), axis=1)
print (a)
[19.  9. 10. 10.  7.  9. 10. 10. nan nan nan  7. nan nan nan]

性能:原始示例

In [63]: %%timeit
    ...: [min(df[df['s2'].le(element)].col1, default = np.NaN) for element in s1]
    ...: 
    ...: 
9.21 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [64]: %%timeit
    ...: m = df['s2'].to_numpy() <= s1.to_numpy()[:, None]
    ...: a = np.nanmin(np.where(m, df['col1'], np.nan), axis=1)
72.4 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

大数据100次:

#2k rows
df = pd.concat([df] * 100, ignore_index=True)
#1.5k rows
s1 = pd.concat([s1] * 100, ignore_index=True)


In [68]: %%timeit
    ...: [min(df[df['s2'].le(element)].col1, default = np.NaN) for element in s1]
    ...: 
    ...: 
1.12 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [69]: %%timeit
    ...: m = df['s2'].to_numpy() <= s1.to_numpy()[:, None]
    ...: a = np.nanmin(np.where(m, df['col1'], np.nan), axis=1)
34.2 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案 1 :(得分:1)

您可以使用间隔索引。

首先获取数据:

df1 = pd.DataFrame(
    np.array(
        [
            4.5,
            15.0,
            13.0,
            14.0,
            18.0,
            15.0,
            13.0,
            14.0,
            np.nan,
            np.nan,
            np.nan,
            18.0,
            np.nan,
            np.nan,
            np.nan,
        ]
    ),
    columns=["s1"],
)
print(df1)
       s1
0   4.500
1  15.000
2  13.000
3  14.000
4  18.000
5  15.000
6  13.000
7  14.000
8     nan
9     nan
10    nan
11 18.000
12    nan
13    nan
14    nan

然后查找数据框:

df = pd.DataFrame.from_dict(
    {
        "col1": {
            0: 20.0,
            1: 19.0,
            2: 18.0,
            3: 17.0,
            4: 16.0,
            5: 15.0,
            6: 14.0,
            7: 13.0,
            8: 12.0,
            9: 11.0,
            10: 10.0,
            11: 9.0,
            12: 8.0,
            13: 7.0,
            14: 6.0,
            15: 5.0,
            16: 4.0,
            17: 3.0,
            18: 2.0,
            19: 1.0,
        },
        "end": {
            0: 0.0,
            1: 4.5,
            2: 5.0,
            3: 6.0,
            4: 7.0,
            5: 8.0,
            6: 9.0,
            7: 10.0,
            8: 11.0,
            9: 12.0,
            10: 13.0,
            11: 15.0,
            12: 16.0,
            13: 18.0,
            14: 20.0,
            15: 22.0,
            16: 24.0,
            17: 26.0,
            18: 28.0,
            19: 100.0,
        },
    }
)
print(df)
    col1     end
0  20.000   0.000
1  19.000   4.500
2  18.000   5.000
3  17.000   6.000
4  16.000   7.000
5  15.000   8.000
6  14.000   9.000
7  13.000  10.000
8  12.000  11.000
9  11.000  12.000
10 10.000  13.000
11  9.000  15.000
12  8.000  16.000
13  7.000  18.000
14  6.000  20.000
15  5.000  22.000
16  4.000  24.000
17  3.000  26.000
18  2.000  28.000
19  1.000 100.000

在开始列中创建一个间隔,并在第一行中填充零。

df["start"] = df["end"].shift().fillna(0)
print(df.head())
    col1   end  start
0 20.000 0.000  0.000
1 19.000 4.500  0.000
2 18.000 5.000  4.500
3 17.000 6.000  5.000
4 16.000 7.000  6.000

创建间隔索引并设置为索引。

idx = pd.IntervalIndex.from_arrays(df["start"], df["end"], closed="right")
df.index = idx
print(df.head())
             col1   end  start
(0.0, 0.0] 20.000 0.000  0.000
(0.0, 4.5] 19.000 4.500  0.000
(4.5, 5.0] 18.000 5.000  4.500
(5.0, 6.0] 17.000 6.000  5.000
(6.0, 7.0] 16.000 7.000  6.000

最终结果

df1.loc[df1.dropna().index, "col1"] = df.loc[df1.loc[:, "s1"].dropna(), "col1"].values

print(df1)
      s1   col1
0   4.500 19.000
1  15.000  9.000
2  13.000 10.000
3  14.000  9.000
4  18.000  7.000
5  15.000  9.000
6  13.000 10.000
7  14.000  9.000
8     nan    nan
9     nan    nan
10    nan    nan
11 18.000  7.000
12    nan    nan
13    nan    nan
14    nan    nan

没有打印输出的完整代码。

df["start"] = df["end"].shift().fillna(0)

idx = pd.IntervalIndex.from_arrays(df["start"], df["end"], closed="right")
df.index = idx

df1.loc[df1.dropna().index, "col1"] = df.loc[df1.loc[:, "s1"].dropna(), "col1"].values