我想对自由文本的名称进行分类,然后在此之后进行分类变量
Only first
:只有首字母大写Standard usage
:每个单词的首字母大写All capital
:每个字母都用大写字母All small
:每个字母都在情人的情况下Unidentified
:不在上述4个类别中这是我的数据
Id Name
1 Donald trump
2 Barack Obama
3 Hillary ClintoN
4 BILL GATES
5 jeff bezoz
6 Mark Zuckerberg
我想要的
Id Name Category
1 Donald trump Only first
2 Barack Obama Standard usage
3 Hillary ClintoN Unidentified
4 BILL GATES All capital
5 jeff bezoz All small
6 Mark Zuckerberg Standard usage
我所做的是
df['Uppercase'] = df['Name'].str.findall(r'[A-Z]').str.len()
df['Lowercase'] = df['Name'].str.findall(r'[a-z]').str.len()
df['WordCount'] = df['Name'].str.count(' ') + 1
然后使用map
函数执行一些逻辑,例如:
`df['Lowercase'] = 0` for `All capital`
`df['Uppercase'] = 0` for `All small`
`df['Uppercase'] - df['WordCount'] = 0` for `Standard usage`
`df['Uppercase'] = 1 and `df['WordCount']` for `Only first`
如果这不属于任何标记为Unidentified
的东西
但是,naBih baWazir
将根据标准规则记录为Standard usage
,而不是Unidentified
,我认为还有更好的方法
答案 0 :(得分:2)
使用功能Series.str.islower
Series.str.isupper
Series.str.istitle
和新列numpy.select
:
#test all letters without first for lower and first value for upper
m1 = df['Name'].str[1:].str.islower() & df['Name'].str[0].str.isupper()
m2 = df['Name'].str.istitle()
m3 = df['Name'].str.islower()
m4 = df['Name'].str.isupper()
df['Category'] = np.select([m1, m2, m3, m4],
['Only first','Standard usage','All small','All capital'],
default='Unidentified ')
print (df)
Id Name Category
0 1 Donald trump Only first
1 2 Barack Obama Standard usage
2 3 Hillary ClintoN Unidentified
3 4 BILL GATES All capital
4 5 jeff bezoz All small
5 6 Mark Zuckerberg Standard usage
@Jon Clements的想法,谢谢:
m1 = df['Name'].str[1:].str.islower() & df['Name'].str[0].str.isupper()
df1 = df.Name.agg([str.istitle, str.islower, str.isupper])
df['Category'] = np.select(
[m1, *df1.values.T],
['Only first','Standard usage','All small','All capital'],
default='Unidentified '
)
答案 1 :(得分:2)
您可能需要根据需要修改功能。但这将为您提供一个使用python内置函数进行操作的大致思路。 您可以使用类似这样的东西。
name_list = ['Donald trump','Barack Obama','Hillary Clinton','BILL GATES','jeff bezoz','Mark Zuckerberg']
for name in name_list:
if name.isupper():
print(name, 'All capital')
elif name.islower():
print(name, 'All small')
elif name.istitle():
print(name, 'Standard usage')
elif (name[0].isupper() and name[1:].islower()):
print(name, 'Only first')
else:
print(name, 'Unidentified')