Question

我有一个不规则格式的作者姓名列表，如下所示：

df = pd.DataFrame({'author':['fox district judge', 'louise w flanagan united states district judge', 'amy berman jackson united states district judge', 'rhesa hawkins barksdale, circuit judge','kanne, circuit judge'])

在解析姓氏时，我使用了df['Last_Name'] = df['author'].apply(lambda x: x.split(',')[0].split(' ')[-1])，但是此行代码仅适用于姓氏的两位作者。如何从前两行提取诸如fox和flanagan之类的姓氏？

Answer 1

与最后一条评论类似，如果您希望使事情自动化，并且不能仅通过手动编辑就无法“清理”数据，我建议从字符串中删除除名称之外的所有内容（例如，删除单词district，circuit ，法官，美国，州）以及任何逗号。这样，剩下的就是名字，即使只留下姓氏，您也知道姓氏将始终在-1索引中：

last_names = []
to_delete = ['united', 'states', 'district', 'circuit', 'judge']

strings = list(df['author'])
for author_string in strings:
    author_string = author_string.replace(',', "") # remove any commas
    lst = author_string.split(' ')
    temp = lst.copy() # create copy of lst so we can actually remove words
    for word in lst:
        if word in to_delete:
            temp.remove(word)
    last_names.append(temp[-1]) # since only names are left, last name is always the last index

df['Last_Name'] = last_names

虽然它不像最初的方法那么漂亮，但是当我尝试这种方法时似乎奏效了

如何根据熊猫的条件获取姓氏

1 个答案: