有没有更好的方法基于大写和小写进行名称分类?

时间:2018-08-21 11:28:20

标签: python pandas dataframe nlp

我想对自由文本的名称进行分类,然后在此之后进行分类变量

  1. Only first:只有首字母大写
  2. Standard usage:每个单词的首字母大写
  3. All capital:每个字母都用大写字母
  4. All small:每个字母都在情人的情况下
  5. Unidentified:不在上述4个类别中

这是我的数据

Id   Name
1    Donald trump
2    Barack Obama
3    Hillary ClintoN
4    BILL GATES
5    jeff bezoz
6    Mark Zuckerberg

我想要的

Id   Name                 Category
1    Donald trump         Only first
2    Barack Obama         Standard usage
3    Hillary ClintoN      Unidentified      
4    BILL GATES           All capital
5    jeff bezoz           All small
6    Mark Zuckerberg      Standard usage

我所做的是

df['Uppercase'] = df['Name'].str.findall(r'[A-Z]').str.len()
df['Lowercase'] = df['Name'].str.findall(r'[a-z]').str.len()
df['WordCount'] = df['Name'].str.count(' ') + 1

然后使用map函数执行一些逻辑,例如:

`df['Lowercase'] = 0` for `All capital`
`df['Uppercase'] = 0` for `All small`
`df['Uppercase'] - df['WordCount'] = 0` for `Standard usage`
`df['Uppercase'] = 1 and `df['WordCount']` for `Only first`

如果这不属于任何标记为Unidentified的东西

但是,naBih baWazir将根据标准规则记录为Standard usage,而不是Unidentified,我认为还有更好的方法

2 个答案:

答案 0 :(得分:2)

使用功能Series.str.islower Series.str.isupper Series.str.istitle和新列numpy.select

#test all letters without first for lower and first value for upper
m1 = df['Name'].str[1:].str.islower() & df['Name'].str[0].str.isupper()
m2 = df['Name'].str.istitle()
m3 = df['Name'].str.islower()
m4 = df['Name'].str.isupper()

df['Category'] = np.select([m1, m2, m3, m4], 
                           ['Only first','Standard usage','All small','All capital'], 
                           default='Unidentified ')
print (df)
   Id             Name        Category
0   1     Donald trump      Only first
1   2     Barack Obama  Standard usage
2   3  Hillary ClintoN   Unidentified 
3   4       BILL GATES     All capital
4   5       jeff bezoz       All small
5   6  Mark Zuckerberg  Standard usage

@Jon Clements的想法,谢谢:

m1 = df['Name'].str[1:].str.islower() & df['Name'].str[0].str.isupper()
df1 = df.Name.agg([str.istitle, str.islower, str.isupper])

df['Category'] = np.select(
    [m1, *df1.values.T], 
    ['Only first','Standard usage','All small','All capital'], 
    default='Unidentified '
)

答案 1 :(得分:2)

您可能需要根据需要修改功能。但这将为您提供一个使用python内置函数进行操作的大致思路。 您可以使用类似这样的东西。

name_list = ['Donald trump','Barack Obama','Hillary Clinton','BILL GATES','jeff bezoz','Mark Zuckerberg']

for name in name_list:
    if name.isupper():
        print(name, 'All capital')
    elif name.islower():
        print(name, 'All small')
    elif name.istitle():
        print(name, 'Standard usage')
    elif (name[0].isupper() and name[1:].islower()):
        print(name, 'Only first')
    else:
        print(name, 'Unidentified')