我尝试过与其他问题不同的方法,但仍然无法找到解决问题的正确答案。关键的一点是,如果这个人被算作西班牙裔,他们就不能被视为其他任何东西。即使他们有一个" 1"在另一个种族专栏中,他们仍被视为西班牙裔,而非两个或更多种族。类似地,如果所有ERI列的总和大于1,则它们被计为两个或更多种族,并且不能被计为独特的种族(接受西班牙裔)。希望这是有道理的。任何帮助将不胜感激。
它几乎就像在每行中执行for循环一样,如果每条记录符合条件,它们将被添加到一个列表中并从原始列表中删除。
从下面的数据框中,我需要根据以下内容计算新列:
========================= CRITERIA ===================== ==========
IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”
评论:如果西班牙裔美国人的ERI标志为真(1),那么员工被归类为“西班牙裔”
评论:如果超过1个非西班牙语ERI标志为真,则返回“两个或更多”
====================== DATAFRAME ======================== ===
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined
0 MOST JEFF E 0 0 0 0 0 1 White
1 CRUISE TOM E 0 0 0 1 0 0 White
2 DEPP JOHNNY 0 0 0 0 0 1 Unknown
3 DICAP LEO 0 0 0 0 0 1 Unknown
4 BRANDO MARLON E 0 0 0 0 0 0 White
5 HANKS TOM 0 0 0 0 0 1 Unknown
6 DENIRO ROBERT E 0 1 0 0 0 1 White
7 PACINO AL E 0 0 0 0 0 1 White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White
9 EASTWOOD CLINT E 0 0 0 0 0 1 White
答案 0 :(得分:259)
好的,这两个步骤 - 首先是编写一个执行你想要的翻译的函数 - 我已经根据你的伪代码一起举了一个例子:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
你可能想要讨论这个问题,但它似乎可以解决这个问题 - 注意进入函数的参数被认为是一个标记为"行"的系列对象。
接下来,使用pandas中的apply函数来应用函数 - 例如
df.apply (lambda row: label_race(row), axis=1)
注意axis = 1说明符,这意味着应用程序是在一行而不是列级别完成的。结果如下:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
如果您对这些结果感到满意,请再次运行,将结果保存到原始数据框中的新列中。
df['race_label'] = df.apply (lambda row: label_race(row), axis=1)
结果数据框如下所示(向右滚动以查看新列):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
答案 1 :(得分:133)
由于这是“其他人的pandas新专栏”的第一个Google结果,这里有一个简单的例子:
df = df.assign(**{'some column name': col.values})
如果你得到<!-- Base application theme. -->
<style name="AppTheme" parent="Theme.AppCompat.Light.DarkActionBar">
<item name="colorPrimary">@color/primary_color</item>
<item name="colorPrimaryDark">@color/primary_dark_color</item>
</style>
<!-- Style for a category of vocabulary words -->
<style name="CategoryStyle">
<item name="android:layout_width">match_parent</item>
<item name="android:layout_height">@dimens/list_item_height</item>
<item name="android:gravity">center_vertical</item>
<item name="android:padding">16dp</item>
<item name="android:textColor">@android:color/white</item>
<item name="android:textStyle">bold</item>
<item name="android:textAppearance">?android:textAppearanceMedium</item>
</style>
,你也可以这样做:
styles.xml
来源:https://stackoverflow.com/a/12555510/243392
如果你的列名包含空格,你可以使用如下语法:
Cannot resolve symbol'android:layout_width'
答案 2 :(得分:21)
.apply()
接受函数作为第一个参数;传递label_race
函数,如下:
df['race_label'] = df.apply(label_race, axis=1)
您不需要使用lambda函数来传递函数。
答案 3 :(得分:15)
上面的答案是完全正确的,但是存在矢量化解决方案,形式为numpy.select
。这样,您可以定义条件,然后为这些条件定义输出,这比使用apply
更为有效:
首先,定义条件:
conditions = [
df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1,
]
现在,定义相应的输出:
outputs = [
'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]
最后,使用numpy.select
:
res = np.select(conditions, outputs, 'Other')
pd.Series(res)
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
为什么在numpy.select
上使用apply
?以下是一些性能检查:
df = pd.concat([df]*1000)
In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %%timeit
...: conditions = [
...: df['eri_hispanic'] == 1,
...: df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
...: df['eri_nat_amer'] == 1,
...: df['eri_asian'] == 1,
...: df['eri_afr_amer'] == 1,
...: df['eri_hawaiian'] == 1,
...: df['eri_white'] == 1,
...: ]
...:
...: outputs = [
...: 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
...: ]
...:
...: np.select(conditions, outputs, 'Other')
...:
...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
使用numpy.select
可以大大改善我们的性能,并且差异只会随着数据的增长而增加。
答案 4 :(得分:5)
尝试一下
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
O / P:
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \
0 MOST JEFF E 0 0 0
1 CRUISE TOM E 0 0 0
2 DEPP JOHNNY NaN 0 0 0
3 DICAP LEO NaN 0 0 0
4 BRANDO MARLON E 0 0 0
5 HANKS TOM NaN 0 0 0
6 DENIRO ROBERT E 0 1 0
7 PACINO AL E 0 0 0
8 WILLIAMS ROBIN E 0 0 1
9 EASTWOOD CLINT E 0 0 0
eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 0 0 1 White White
1 1 0 0 White Hispanic
2 0 0 1 Unknown White
3 0 0 1 Unknown White
4 0 0 0 White Other
5 0 0 1 Unknown White
6 0 0 1 White Two Or More
7 0 0 1 White White
8 0 0 0 White Haw/Pac Isl.
9 0 0 1 White White
使用.loc
代替apply
。
它改善了向量化。
.loc
以简单的方式工作,根据条件屏蔽行,将值应用于冻结行。
有关更多详细信息,请访问 .loc docs
效果指标:
接受的答案:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)
%timeit df.apply(lambda row: label_race(row), axis=1)
每个循环1.15 s±46.5 ms(平均±标准偏差,共运行7次,每个循环1次)
我的建议答案:
def label_race(df):
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)
%timeit label_race(df)
每个循环24.7 ms±1.7 ms(平均±标准偏差,共运行7次,每个循环10个循环)
答案 5 :(得分:0)
正如@user3483203 指出的,numpy.select 是最好的方法
将您的条件语句和相应的操作存储在两个列表中
conds = [(df['eri_hispanic'] == 1),(df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1)),(df['eri_nat_amer'] == 1),(df['eri_asian'] == 1),(df['eri_afr_amer'] == 1),(df['eri_hawaiian'] == 1),(df['eri_white'] == 1,])
actions = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']
您现在可以使用 np.select 使用这些列表作为其参数
df['label_race'] = np.select(conds,actions,default='Other')
参考:https://numpy.org/doc/stable/reference/generated/numpy.select.html
答案 6 :(得分:0)
另一种(易于推广的)方法,其基石是 pandas.DataFrame.idxmax
。首先,易于推广的序言。
# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'
# The "dictionary-based" if-else statements
labels = {
_gt_1_key : 'Two Or More',
'eri_hispanic': 'Hispanic',
'eri_nat_amer': 'A/I AK Native',
'eri_asian' : 'Asian',
'eri_afr_amer': 'Black/AA',
'eri_hawaiian': 'Haw/Pac Isl.',
'eri_white' : 'White',
_lt_1_key : 'Other',
}
# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy() # `~.copy` to avoid `SettingWithCopyWarning`
... 最后,在 vectorized fashion 中:
mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label = mat.idxmax(axis=1).map(labels)
哪里
>>> race_label
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
这是一个 pandas.Series
实例,您可以轻松地在 df
中托管,即执行 df['race_label'] = race_label
。