如何根据熊猫的条件获取姓氏

时间:2020-09-16 17:41:06

标签: python pandas nlp

我有一个不规则格式的作者姓名列表,如下所示:

df = pd.DataFrame({'author':['fox district judge', 'louise w flanagan united states district judge', 'amy berman jackson united states district judge', 'rhesa hawkins barksdale, circuit judge','kanne, circuit judge']) 

在解析姓氏时,我使用了df['Last_Name'] = df['author'].apply(lambda x: x.split(',')[0].split(' ')[-1]),但是此行代码仅适用于姓氏的两位作者。如何从前两行提取诸如foxflanagan之类的姓氏?

1 个答案:

答案 0 :(得分:1)

与最后一条评论类似,如果您希望使事情自动化,并且不能仅通过手动编辑就无法“清理”数据,我建议从字符串中删除除名称之外的所有内容(例如,删除单词district,circuit ,法官,美国,州)以及任何逗号。这样,剩下的就是名字,即使只留下姓氏,您也知道姓氏将始终在-1索引中:

last_names = []
to_delete = ['united', 'states', 'district', 'circuit', 'judge']

strings = list(df['author'])
for author_string in strings:
    author_string = author_string.replace(',', "") # remove any commas
    lst = author_string.split(' ')
    temp = lst.copy() # create copy of lst so we can actually remove words
    for word in lst:
        if word in to_delete:
            temp.remove(word)
    last_names.append(temp[-1]) # since only names are left, last name is always the last index

df['Last_Name'] = last_names

虽然它不像最初的方法那么漂亮,但是当我尝试这种方法时似乎奏效了