我有一个名为“ data”的数据框,如下所示:
id email_body
1 text_1
2 text_2
3 text_3
4 text_4
5 text_5
6 text_6
7 text_7
8 text_8
9 text_9
10 text_10
我正在使用以下代码从不同的行中提取不同的“ text_i”中包含的全名,名和姓:
import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet
def get_human_names(text):
tokens = nltk.tokenize.word_tokenize(text)
pos = nltk.pos_tag(tokens)
sentt = nltk.ne_chunk(pos, binary = False)
person_list = []
lastname = []
firstname = []
person = []
name = ""
for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
for leaf in subtree.leaves():
person.append(leaf[0])
if len(person) > 1: #avoid grabbing lone surnames
for part in person:
name += part + ' '
if name[:-1] not in person_list:
person_list.append(name[:-1])
for person in person_list:
person_split = person.split(" ")
for name in person_split:
if wordnet.synsets(name):
if(name in person):
person_list.remove(person)
break
firstname = [i.split(' ')[0] for i in person_list]
lastname = [i.split(' ')[1] for i in person_list]
name = ''
person = []
return person_list, firstname, lastname
names = data.email_body.apply(get_human_names)
columns = ['names','firstname','lastname' ]
data_2 = pd.DataFrame([names[0],names[1],names[2]], columns = columns)
data_2
我正在获取以下数据集:
id names firstname lastname
0 [Lesley Kirchman, Milap Majmundar, Segoe UI] [Lesley, Milap, Segoe] [Kirchman, Majmundar, UI]
1 [Gerrit Boerman, Lesley Kirchman, Segoe UI] [Gerrit, Lesley, Segoe] [Boerman, Kirchman, UI]
2 [Lesley Kirchman] [Lesley] [Kirchman]
您可以观察到我只有3行,如何将函数应用于整个初始数据框“数据”,从而获得具有10行的结果数据框?
此致
答案 0 :(得分:0)
我找到了解决方法:
data_2 = pd.DataFrame.from_items(zip(names.index, names.values)).T
此致