我有一个不规则格式的作者姓名列表,如下所示:
df = pd.DataFrame({'author':['fox district judge', 'louise w flanagan united states district judge', 'amy berman jackson united states district judge', 'rhesa hawkins barksdale, circuit judge','kanne, circuit judge'])
在解析姓氏时,我使用了df['Last_Name'] = df['author'].apply(lambda x: x.split(',')[0].split(' ')[-1])
,但是此行代码仅适用于姓氏的两位作者。如何从前两行提取诸如fox
和flanagan
之类的姓氏?
答案 0 :(得分:1)
与最后一条评论类似,如果您希望使事情自动化,并且不能仅通过手动编辑就无法“清理”数据,我建议从字符串中删除除名称之外的所有内容(例如,删除单词district,circuit ,法官,美国,州)以及任何逗号。这样,剩下的就是名字,即使只留下姓氏,您也知道姓氏将始终在-1索引中:
last_names = []
to_delete = ['united', 'states', 'district', 'circuit', 'judge']
strings = list(df['author'])
for author_string in strings:
author_string = author_string.replace(',', "") # remove any commas
lst = author_string.split(' ')
temp = lst.copy() # create copy of lst so we can actually remove words
for word in lst:
if word in to_delete:
temp.remove(word)
last_names.append(temp[-1]) # since only names are left, last name is always the last index
df['Last_Name'] = last_names
虽然它不像最初的方法那么漂亮,但是当我尝试这种方法时似乎奏效了