我有两个大熊猫系列:
$kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mynginx1 NodePort 10.97.142.170 <none> 80:31591/TCP 8m9s
可能具有非常多的行,并且有些---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- name: http
containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-svc
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
targetPort: 80
type: NodePort
和$ curl -IL http://localhost:31591
HTTP/1.1 200 OK
Server: nginx/1.17.10
Date: Tue, 12 May 2020 10:05:04 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 14 Apr 2020 14:19:26 GMT
Connection: keep-alive
ETag: "5e95c66e-264"
Accept-Ranges: bytes
(这是数据帧(s1
中的一列),只有20行。 br />
两个系列的索引不同。
NaN
对于s2
中的每个df
,我要检索s1:
id
1 4.5
2 15.0
3 13.0
4 14.0
5 18.0
6 15.0
7 13.0
8 14.0
9 NaN
10 NaN
11 NaN
12 18.0
13 NaN
14 NaN
15 NaN
df:
col1 s2
0 20.0 0.0
1 19.0 4.5
2 18.0 5.0
3 17.0 6.0
4 16.0 7.0
5 15.0 8.0
6 14.0 9.0
7 13.0 10.0
8 12.0 11.0
9 11.0 12.0
10 10.0 13.0
11 9.0 15.0
12 8.0 16.0
13 7.0 18.0
14 6.0 20.0
15 5.0 22.0
16 4.0 24.0
17 3.0 26.0
18 2.0 28.0
19 1.0 100.0
中第一个元素小于或等于id
的{{1}}的值。
即对于s1
,我们有col1
,它小于或等于s2
,因此我想检索值id
。
因此,对于id 1
中的s1 = 4.5
,我需要检索df.s2 = 4.5
中的值19
这是我当前的解决方案。我想知道是否有更好的方法(更快,也许是熊猫函数?)来获得相同的结果:
id=2
答案 0 :(得分:3)
想法是使用numpy并将2d数组中列中的每个值与Series
的每个值进行比较,然后传递到numpy.where
,如果不匹配则设置NaN
,最后使用{{3 }}:
m = df['s2'].to_numpy() <= s1.to_numpy()[:, None]
a = np.nanmin(np.where(m, df['col1'], np.nan), axis=1)
print (a)
[19. 9. 10. 10. 7. 9. 10. 10. nan nan nan 7. nan nan nan]
性能:原始示例
In [63]: %%timeit
...: [min(df[df['s2'].le(element)].col1, default = np.NaN) for element in s1]
...:
...:
9.21 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [64]: %%timeit
...: m = df['s2'].to_numpy() <= s1.to_numpy()[:, None]
...: a = np.nanmin(np.where(m, df['col1'], np.nan), axis=1)
72.4 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
大数据100次:
#2k rows
df = pd.concat([df] * 100, ignore_index=True)
#1.5k rows
s1 = pd.concat([s1] * 100, ignore_index=True)
In [68]: %%timeit
...: [min(df[df['s2'].le(element)].col1, default = np.NaN) for element in s1]
...:
...:
1.12 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %%timeit
...: m = df['s2'].to_numpy() <= s1.to_numpy()[:, None]
...: a = np.nanmin(np.where(m, df['col1'], np.nan), axis=1)
34.2 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 1 :(得分:1)
您可以使用间隔索引。
首先获取数据:
df1 = pd.DataFrame(
np.array(
[
4.5,
15.0,
13.0,
14.0,
18.0,
15.0,
13.0,
14.0,
np.nan,
np.nan,
np.nan,
18.0,
np.nan,
np.nan,
np.nan,
]
),
columns=["s1"],
)
print(df1)
s1
0 4.500
1 15.000
2 13.000
3 14.000
4 18.000
5 15.000
6 13.000
7 14.000
8 nan
9 nan
10 nan
11 18.000
12 nan
13 nan
14 nan
然后查找数据框:
df = pd.DataFrame.from_dict(
{
"col1": {
0: 20.0,
1: 19.0,
2: 18.0,
3: 17.0,
4: 16.0,
5: 15.0,
6: 14.0,
7: 13.0,
8: 12.0,
9: 11.0,
10: 10.0,
11: 9.0,
12: 8.0,
13: 7.0,
14: 6.0,
15: 5.0,
16: 4.0,
17: 3.0,
18: 2.0,
19: 1.0,
},
"end": {
0: 0.0,
1: 4.5,
2: 5.0,
3: 6.0,
4: 7.0,
5: 8.0,
6: 9.0,
7: 10.0,
8: 11.0,
9: 12.0,
10: 13.0,
11: 15.0,
12: 16.0,
13: 18.0,
14: 20.0,
15: 22.0,
16: 24.0,
17: 26.0,
18: 28.0,
19: 100.0,
},
}
)
print(df)
col1 end
0 20.000 0.000
1 19.000 4.500
2 18.000 5.000
3 17.000 6.000
4 16.000 7.000
5 15.000 8.000
6 14.000 9.000
7 13.000 10.000
8 12.000 11.000
9 11.000 12.000
10 10.000 13.000
11 9.000 15.000
12 8.000 16.000
13 7.000 18.000
14 6.000 20.000
15 5.000 22.000
16 4.000 24.000
17 3.000 26.000
18 2.000 28.000
19 1.000 100.000
在开始列中创建一个间隔,并在第一行中填充零。
df["start"] = df["end"].shift().fillna(0)
print(df.head())
col1 end start
0 20.000 0.000 0.000
1 19.000 4.500 0.000
2 18.000 5.000 4.500
3 17.000 6.000 5.000
4 16.000 7.000 6.000
创建间隔索引并设置为索引。
idx = pd.IntervalIndex.from_arrays(df["start"], df["end"], closed="right")
df.index = idx
print(df.head())
col1 end start
(0.0, 0.0] 20.000 0.000 0.000
(0.0, 4.5] 19.000 4.500 0.000
(4.5, 5.0] 18.000 5.000 4.500
(5.0, 6.0] 17.000 6.000 5.000
(6.0, 7.0] 16.000 7.000 6.000
最终结果
df1.loc[df1.dropna().index, "col1"] = df.loc[df1.loc[:, "s1"].dropna(), "col1"].values
print(df1)
s1 col1
0 4.500 19.000
1 15.000 9.000
2 13.000 10.000
3 14.000 9.000
4 18.000 7.000
5 15.000 9.000
6 13.000 10.000
7 14.000 9.000
8 nan nan
9 nan nan
10 nan nan
11 18.000 7.000
12 nan nan
13 nan nan
14 nan nan
没有打印输出的完整代码。
df["start"] = df["end"].shift().fillna(0)
idx = pd.IntervalIndex.from_arrays(df["start"], df["end"], closed="right")
df.index = idx
df1.loc[df1.dropna().index, "col1"] = df.loc[df1.loc[:, "s1"].dropna(), "col1"].values