不确定如何像在标准python代码中那样在数据帧上计算IF条件。
我有以下df:
“标签”中的值对应于每一行的最大值。例如,第(0)行的最大值对应于NO_2。
我要根据以下图表替换“标签”中的值:
例如,对于第(0)行,“ Label”值对应于如上所述的NO_2,因此检查图表,NO_2的值67.120003处于40-100的范围内,因此我想替换第(0)行带有2的“标签”值。
这是一个片段数据(*注:为了举例说明,为了获得每种污染物的最大值的可变性,我对此进行了一些修改):
date O_3 PM25 PM10 CO SO_2 NO_2 Label
0 2001-01-01 01:00:00 7.86 12.505127 32.349998 0.45 26.459999 67.120003 67.120003
1 2001-01-01 02:00:00 7.21 12.505127 40.709999 0.48 20.879999 70.620003 70.620003
2 2001-01-01 03:00:00 7.11 12.505127 50.209999 0.41 21.580000 72.629997 72.629997
3 2001-01-01 04:00:00 7.14 12.505127 54.880001 0.51 19.270000 75.029999 75.029999
4 2001-01-01 05:00:00 8.46 12.505127 42.340000 0.19 13.640000 66.589996 66.589996
5 2018-04-30 20:00:00 63.00 200.000000 2.000000 0.30 4.000000 58.000000 200.000000
6 2018-04-30 21:00:00 49.00 400.000000 5.000000 0.30 4.000000 65.000000 400.000000
7 2018-04-30 22:00:00 49.00 3.000000 125.000000 0.30 4.000000 58.000000 125.000000
8 2018-04-30 23:00:00 48.00 7.000000 7.000000 0.30 4.000000 52.000000 52.000000
9 2018-05-01 00:00:00 52.00 4.000000 6.000000 0.30 4.000000 43.000000 52.000000
因此,为了从每一行中获取最大值,我正在做的是:
# Getting max values from each contaminant on each row
max_value = final_df.max(axis=1)
为了获得最大值的列名:
# Obtaining maximum value column name for each row
label_max_colName = final_df.eq(final_df.max(1),
axis=0).dot(final_df.columns)
我遵循了@ TH14提出的一种解决方案:
for index, val in final_df[[x for x in final_df.columns if x != 'date']].iterrows():
max_column = np.argmax(val)
max_column_val = np.max(val)
if max_column == "O_3":
if max_column_val <= 80:
final_df.at[index, 'Label'] = 1
if 80 < max_column_val < 120:
final_df.at[index, 'Label'] = 2
if 120 < max_column_val < 180:
final_df.at[index, 'Label'] = 3
if 180 < max_column_val < 240:
final_df.at[index, 'Label'] = 4
if 240 < max_column_val < 600:
final_df.at[index, 'Label'] = 5
if max_column == "NO_2":
if max_column_val <= 40:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 100:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 200:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 400:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 1000:
final_df.at[index, 'Label'] = 5
if max_column == "SO_2":
if max_column_val <= 100:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 200:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 350:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 500:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 1250:
final_df.at[index, 'Label'] = 5
if max_column == "PM10":
if max_column_val <= 20:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 35:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 50:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 100:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 1200:
final_df.at[index, 'Label'] = 5
if max_column == "PM25":
if max_column_val <= 10:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 20:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 25:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 50:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 800:
final_df.at[index, 'Label'] = 5
但在“标签”列中似乎没有任何改变:
答案 0 :(得分:1)
一种方法是定义一个函数,该函数接收污染物和浓度水平并返回标签编号,如下所示:
def get_pollution_label(pollutant, concentration):
if pollutant == 'o_3':
if 0 < con < 80:
return 1
.
.
.
编写此函数(该函数应该只是与表相对应的一系列'if-else'之后),您可以遍历行并执行以下操作:
import numpy as np
import pandas as pd
for _, row in df.iterrows():
df['Label'] = get_pollution_label(df.columns[np.argmax(row)], np.max(row))
答案 1 :(得分:1)
我仅在两列中添加了if else条件,但您明白了。
df['Label'] = df.max(axis=1)
for index, val in final_df[[x for x in final_df.columns if x != 'date']].iterrows():
max_column = np.argmax(val)
max_column_val = np.max(val)
if max_column == "O_3":
if max_column_val <= 80:
final_df.at[index, 'Label'] = 1
if 80 < max_column_val < 120:
final_df.at[index, 'Label'] = 2
if 120 < max_column_val < 180:
final_df.at[index, 'Label'] = 3
if 180 < max_column_val < 240:
final_df.at[index, 'Label'] = 4
if 240 < max_column_val < 600:
final_df.at[index, 'Label'] = 5
if max_column == "NO_2":
if max_column_val <= 40:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 100:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 200:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 400:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 1000:
final_df.at[index, 'Label'] = 5
if max_column == "SO_2":
if max_column_val <= 100:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 200:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 350:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 500:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 1250:
final_df.at[index, 'Label'] = 5
if max_column == "PM10":
if max_column_val <= 20:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 35:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 50:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 100:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 1200:
final_df.at[index, 'Label'] = 5
if max_column == "PM25":
if max_column_val <= 10:
final_df.at[index, 'Label'] = 1
if 40 < max_column_val < 20:
final_df.at[index, 'Label'] = 2
if 100 < max_column_val < 25:
final_df.at[index, 'Label'] = 3
if 200 < max_column_val < 50:
final_df.at[index, 'Label'] = 4
if 400 < max_column_val < 800:
final_df.at[index, 'Label'] = 5
您正在使用orKach解决方案遇到此错误,因为您要遍历日期列。
输出:
date O_3 PM25 PM10 CO SO_2 NO_2 Label
0 2001-01-01 01:00:00 7.86 12.505127 32.349998 0.45 26.459999 67.120003 2.0
1 2001-01-01 02:00:00 7.21 12.505127 40.709999 0.48 20.879999 70.620003 2.0
2 2001-01-01 03:00:00 7.11 12.505127 50.209999 0.41 21.580000 72.629997 2.0
3 2001-01-01 04:00:00 7.14 12.505127 54.880001 0.51 19.270000 75.029999 2.0
4 2001-01-01 05:00:00 8.46 12.505127 42.340000 0.19 13.640000 66.589996 2.0
5 2018-04-30 20:00:00 63.00 200.000000 2.000000 0.30 4.000000 58.000000 200.0
6 2018-04-30 21:00:00 49.00 400.000000 5.000000 0.30 4.000000 65.000000 400.0
7 2018-04-30 22:00:00 49.00 3.000000 125.000000 0.30 4.000000 58.000000 125.0
8 2018-04-30 23:00:00 48.00 7.000000 7.000000 0.30 4.000000 52.000000 2.0
9 2018-05-01 00:00:00 52.00 4.000000 6.000000 0.30 4.000000 43.000000 1.0
答案 2 :(得分:0)
假设您将两个表都作为数据框
data_df =
O_3 PM25 ... ...
0 7.86 ...
1 ... ...
2 ... ...
和
category_df =
1 2 3
O_3 80 120 ...
NO2 40 ...
... ... ...
您还可以分别通过df.max(axis=1)
和df.idxmax(axis=1)
识别最大值和相应的列。另外,import numpy as np
可以使用np.where(condition)
函数进行比较,并使用np.max()
标识最大标签。
max_df = pd.DataFrame(my_df.max(axis=1).values, index=my_df.idxmax(axis=1))
labels = []
for idx, row in max_df.iterrows():
labels.append(np.max(np.where(row.values[0] < category_df.loc[idx])))
data_df["Label"] = pd.Series(labels)