Question

我有使用pandas DataFrame表示的数据，例如，看起来如下：

| id | entity | name | value | location

其中id是integer值，entity是integer，name是string，value是integer，而location是string（例如，美国，加拿大，英国等）。

现在，我想向该数据框添加一个新列，列“ flag”，其值分配如下：

for d in df.iterrows():

    if d.entity == 10 and d.value != 1000 and d.location == CA:
        d.flag = "A" 
    elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
        d.flag = "C"
    elif d.entity == 0 and d.value == 1000 and d.location == US"
        d.flag = "B"
    else:
        print("Different case")

是否可以加快速度并使用一些内置函数而不是for循环？

Answer 1

使用np.select，您可以通过条件列表进行选择，并根据条件选择条件，并在不满足任何条件时指定默认值。

"-[h]:mm:ss"

Answer 2

使用按位()-> and添加&以使用numpy.select：

m = [
    (d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
    (d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
    (d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]

df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")

Answer 3

您编写了“查找满足一组条件的所有列”，但是您的代码显示您实际上是在尝试添加一个新列，该列的每一行的值都是根据同一行其他列的值计算得出的。

如果确实如此，可以使用df.apply，为它提供一个计算特定行的值的函数：

def flag_value(row):
    if row.entity == 10 and row.value != 1000 and row.location == CA:
        return "A"
    elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
        return "C"
    elif row.entity == 0 and row.value == 1000 and row.location == US:
        return "B"
    else:
        return "Different case"

df['flag'] = df.apply(flag_value, axis=1)

请查看this related question，以了解更多信息。

如果您确实想查找指定条件的所有列，则对Pandas数据框执行此操作的通常方法是使用df.loc并建立索引：

only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]

加快Pandas的速度：找到满足条件的所有列

3 个答案: