Question

我正在使用以下脚本来创建一个新的Dataframe列，该列的值将取决于REGEX与现有列中的值的匹配：

import pandas as pd 

#Creation of the dataframe
data = [['Value One', 10], ['Value Six', 15],['Value Six', 25], ['Value * Three', 14],['Other', 14]] 

df = pd.DataFrame(data, columns = ['ColumnA', 'columnB'])

#Create new column with the values depending on the values of an existing column 
df.loc[df['ColumnA'].str.match("Value One|Value Two|Value \* Three"),'Category'] = 'One'
df.loc[df['ColumnA'].str.match("Value Four|Value Six|Value \* Five"),'Category'] = 'Two'

#Replace the nulls - the ones that didn't have a match above - with a value
df.Category.fillna('Not Specified', inplace=True)

代码工作正常，但我的目标是对其进行优化，以使其可用于更复杂的场景。我想避免有df.loc的许多行，我想知道是否有一种方法可以使用例如字典来自动执行此操作。

首先，对于需要匹配的值具有不同的列表（{str.match可以用str.contains替换，在这种情况下，我想用括号内的正则表达式替换）
第二，对于要添加到新列中的值具有不同的列表
第三，（这是我在想的，但是可以随便提供任何整体解决方案）一个循环，它将使用df.loc并添加上面的列表。我猜这可能需要使用两种类型的列表创建字典。

Answer 1

我不确定这是否有用或您是否已经知道这一点，但是可以使用vectorize

import numpy as np

def regexr(x):
    if x.match("Value One|Value Two|Value \* Three"):
        return "one"
    elif x.match("Value Four|Value Six|Value \* Five"):
        return "Two"
    else:
        return "Unspecified"

regexr = np.vectorize(regexr)

df['columnA'].values = regexr(df['columnA'].values)

通过Pandas有效地基于现有列的值将值添加到新列

1 个答案: