所以我有一个40,000行的数据集'dd',如下所示:
dd.head(21)
Out[64]:
MT MTBR Prd QPA RT Type WH
0 3 539 24Months 1 'NA' NR 188
1 3 51 24Months 4 'NA' NR 188
2 3 112 24Months 10 6 RP 188
3 3 385 24Months 2 7 RP 188
4 3 206 24Months 1 8 RP 188
5 3 349 24Months 19 'NA' NR 188
6 3 569 24Months 18 'NA' NR 188
7 3 66 24Months 20 8 RP 188
8 3 181 24Months 9 'NA' NR 188
9 3 149 24Months 2 'NA' NR 188
10 3 131 24Months 8 7 RP 188
11 3 289 24Months 11 3 RP 188
12 3 392 24Months 13 2 RP 188
13 3 303 24Months 9 'NA' NR 188
14 3 318 24Months 5 5 RP 188
15 3 103 24Months 9 6 RP 188
16 3 447 24Months 8 6 RP 188
17 3 600 24Months 19 'NA' NR 188
18 3 258 24Months 12 'NA' NR 188
19 3 164 24Months 13 'NA' NR 188
20 3 589 24Months 11 'NA' NR 188
我想在此数据集中创建另一个列 mean_v ,具有以下条件:
for q,m,w,rt,mt in zip(dd.QPA,dd.MT,dd.WH,dd.RT,dd.MTBR):
if dd.Type=='NR':
dd.mean_v = q*m*w*24 / (mt*1000)
elif dd.Type=='RP':
dd.mean_v = q*m*w*rt / (mt*1000)
但我收到以下错误:
ValueError: The truth value of a Series is ambiguous.
Use a.empty, a.bool(), a.item(), a.any() or a.all().
如果有人可以帮我纠正我的代码中的错误,我会非常感激。非常感谢。
答案 0 :(得分:3)
在pandas中是最好的避免循环因为速度慢,所以更好的是使用numpy.select
:
#first replace all numeric to NaN and then to 0
dd.RT =
m1 = dd.Type=='NR'
m2 = dd.Type=='RP'
s = dd.QPA *dd.MT * dd.WH
s1 = dd.MTBR * 1000
s2 = s * 24 / s1
s3 = s * dd.RT / s1
dd['mean_v'] = np.select([m1, m2], [s2, s3], default=np.nan)
但如果NR
列中只有RP
和Type
值使用numpy.where
:
dd['mean_v'] = np.where(m1, s2, s3)
循环版本(非常慢):
dd.RT = pd.to_numeric(dd.RT, errors='coerce').fillna(0)
for i, x in dd.iterrows():
if x['Type'] =='NR':
dd.loc[i, 'mean_v'] = x.QPA*x.MT*x.WH*24 / (x.MTBR*1000)
elif x.Type=='RP':
dd.loc[i, 'mean_v'] = x.QPA*x.MT*x.WH*x.RT / (x.MTBR*1000)
else:
dd.loc[i, 'mean_v'] = np.nan
如果RT
始终24
为TYPE==NR
:
s = pd.to_numeric(dd.RT, errors='coerce').fillna(24)
dd['mean_v'] = (dd.QPA * dd.MT * dd.WH * s) / (dd.MTBR * 1000)
print (dd)
MT MTBR Prd QPA RT Type WH mean_v
0 3 539 24Months 1 0.0 NR 188 0.025113
1 3 51 24Months 4 0.0 NR 188 1.061647
2 3 112 24Months 10 6.0 RP 188 0.302143
3 3 385 24Months 2 7.0 RP 188 0.020509
4 3 206 24Months 1 8.0 RP 188 0.021903
5 3 349 24Months 19 0.0 NR 188 0.736917
6 3 569 24Months 18 0.0 NR 188 0.428204
7 3 66 24Months 20 8.0 RP 188 1.367273
8 3 181 24Months 9 0.0 NR 188 0.673061
9 3 149 24Months 2 0.0 NR 188 0.181691
10 3 131 24Months 8 7.0 RP 188 0.241099
11 3 289 24Months 11 3.0 RP 188 0.064401
12 3 392 24Months 13 2.0 RP 188 0.037408
13 3 303 24Months 9 0.0 NR 188 0.402059
14 3 318 24Months 5 5.0 RP 188 0.044340
15 3 103 24Months 9 6.0 RP 188 0.295689
16 3 447 24Months 8 6.0 RP 188 0.060564
17 3 600 24Months 19 0.0 NR 188 0.428640
18 3 258 24Months 12 0.0 NR 188 0.629581
19 3 164 24Months 13 0.0 NR 188 1.072976
20 3 589 24Months 11 0.0 NR 188 0.252795
<强>计时强>:
In [1]: %timeit jez1(dd)
14.1 ms ± 82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [2]: %timeit jez2(dd)
8.97 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: %timeit jez3(dd)
25.1 s ± 769 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit (jez4(dd))
2.63 ms ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit (rsno(dd))
24.6 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit (rsno1(dd))
1.62 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
dd = pd.concat([dd] * 2000, ignore_index=True)
#print (dd)
def jez1(dd):
dd.RT = pd.to_numeric(dd.RT, errors='coerce').fillna(0)
m1 = dd.Type=='NR'
m2 = dd.Type=='RP'
s = dd.QPA *dd.MT * dd.WH
s1 = dd.MTBR * 1000
s2 = s * 24 / s1
s3 = s * dd.RT / s1
dd['mean_v'] = np.select([m1, m2], [s2, s3], default=np.nan)
return dd
def jez2(dd):
dd.RT = pd.to_numeric(dd.RT, errors='coerce').fillna(0)
m1 = dd.Type=='NR'
s = dd.QPA *dd.MT * dd.WH
s1 = dd.MTBR * 1000
s2 = s * 24 / s1
s3 = s * dd.RT / s1
dd['mean_v'] = np.where(m1, s2, s3)
return dd
def jez3(dd):
dd.RT = pd.to_numeric(dd.RT, errors='coerce').fillna(0)
for i, x in dd.iterrows():
if x['Type'] =='NR':
dd.loc[i, 'mean_v'] = x.QPA*x.MT*x.WH*24 / (x.MTBR*1000)
elif x.Type=='RP':
dd.loc[i, 'mean_v'] = x.QPA*x.MT*x.WH*x.RT / (x.MTBR*1000)
else:
dd.loc[i, 'mean_v'] = np.nan
return dd
def jez4(dd):
dd.RT = pd.to_numeric(dd.RT, errors='coerce').fillna(24)
dd['mean_v'] = (dd.QPA * dd.MT * dd.WH * dd.RT) / (dd.MTBR * 1000)
return dd
def rsno(dd):
dd['RTT'] = list(map(lambda x: int(x) if x != "'NA'" else 24, dd.RT.tolist()))
dd['mean_v'] = (dd.QPA * dd.MT * dd.WH * dd.RTT) / (dd.MTBR * 1000)
return dd
def rsno1(dd):
dd['RTT'] = dd.apply(lambda row: int(row.RT) if row.RT != "'NA'" else 24 , axis=1)
dd['mean_v'] = (dd.QPA * dd.MT * dd.WH * dd.RTT) / (dd.MTBR * 1000)
return dd
答案 1 :(得分:1)
如果dd.RT为'NA',则使用dd.RT或24。因此,您可以使用以下2行代码创建新列并将其用于计算:
dd['RTT'] = list(map(lambda x: int(x) if x != "'NA'" else 24, dd.RT.tolist()))
dd['mean_v'] = (dd.QPA * dd.MT * dd.WH * dd.RTT) / (dd.MTBR * 1000)
print(dd)
输出:
MT MTBR Prd QPA RT Type WH RTT mean_v
0 3 539 24Months 1 'NA' NR 188 24 0.025113
1 3 51 24Months 4 'NA' NR 188 24 1.061647
2 3 112 24Months 10 6 RP 188 6 0.302143
3 3 385 24Months 2 7 RP 188 7 0.020509
4 3 206 24Months 1 8 RP 188 8 0.021903
5 3 349 24Months 19 'NA' NR 188 24 0.736917
6 3 569 24Months 18 'NA' NR 188 24 0.428204
7 3 66 24Months 20 8 RP 188 8 1.367273
8 3 181 24Months 9 'NA' NR 188 24 0.673061
9 3 149 24Months 2 'NA' NR 188 24 0.181691
10 3 131 24Months 8 7 RP 188 7 0.241099
11 3 289 24Months 11 3 RP 188 3 0.064401
12 3 392 24Months 13 2 RP 188 2 0.037408
13 3 303 24Months 9 'NA' NR 188 24 0.402059
14 3 318 24Months 5 5 RP 188 5 0.044340
15 3 103 24Months 9 6 RP 188 6 0.295689
16 3 447 24Months 8 6 RP 188 6 0.060564
17 3 600 24Months 19 'NA' NR 188 24 0.428640
18 3 258 24Months 12 'NA' NR 188 24 0.629581
19 3 164 24Months 13 'NA' NR 188 24 1.072976
20 3 589 24Months 11 'NA' NR 188 24 0.252795
另一个选择是使用apply:
dd['RTT'] = dd.apply(lambda row: int(row.RT) if row.RT != "'NA'" else 24 , axis=1)
dd['mean_v'] = (dd.QPA * dd.MT * dd.WH * dd.RTT) / (dd.MTBR * 1000)