我有使用pandas DataFrame表示的数据,例如,看起来如下:
| id | entity | name | value | location
其中id
是integer
值,entity
是integer
,name
是string
,value
是integer
,而location
是string
(例如,美国,加拿大,英国等)。
现在,我想向该数据框添加一个新列,列“ flag
”,其值分配如下:
for d in df.iterrows():
if d.entity == 10 and d.value != 1000 and d.location == CA:
d.flag = "A"
elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
d.flag = "C"
elif d.entity == 0 and d.value == 1000 and d.location == US"
d.flag = "B"
else:
print("Different case")
是否可以加快速度并使用一些内置函数而不是for循环?
答案 0 :(得分:3)
使用np.select
,您可以通过条件列表进行选择,并根据条件选择条件,并在不满足任何条件时指定默认值。
"-[h]:mm:ss"
答案 1 :(得分:3)
使用按位()
-> and
添加&
以使用numpy.select
:
m = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")
答案 2 :(得分:0)
您编写了“查找满足一组条件的所有列”,但是您的代码显示您实际上是在尝试添加一个新列,该列的每一行的值都是根据同一行其他列的值计算得出的。
如果确实如此,可以使用df.apply
,为它提供一个计算特定行的值的函数:
def flag_value(row):
if row.entity == 10 and row.value != 1000 and row.location == CA:
return "A"
elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
return "C"
elif row.entity == 0 and row.value == 1000 and row.location == US:
return "B"
else:
return "Different case"
df['flag'] = df.apply(flag_value, axis=1)
请查看this related question,以了解更多信息。
如果您确实想查找指定条件的所有列,则对Pandas数据框执行此操作的通常方法是使用df.loc
并建立索引:
only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]