如何将函数应用于整个数据集-Python?

时间:2019-05-08 12:19:51

标签: python nlp

我有一个名为“ data”的数据框,如下所示:

id    email_body
1      text_1
2      text_2
3      text_3
4      text_4
5      text_5
6      text_6
7      text_7
8      text_8
9      text_9
10     text_10

我正在使用以下代码从不同的行中提取不同的“ text_i”中包含的全名,名和姓:

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet 

def get_human_names(text):
  tokens = nltk.tokenize.word_tokenize(text)
  pos = nltk.pos_tag(tokens)
  sentt = nltk.ne_chunk(pos, binary = False)
  person_list = []
  lastname = []
  firstname = []
  person = []
  name = ""
  for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
      for leaf in subtree.leaves():
          person.append(leaf[0])
      if len(person) > 1: #avoid grabbing lone surnames
          for part in person:
              name += part + ' '
          if name[:-1] not in person_list:
              person_list.append(name[:-1])
              for person in person_list:
                person_split = person.split(" ")
                for name in person_split:
                  if wordnet.synsets(name):
                    if(name in person):
                      person_list.remove(person)

                      break
      firstname = [i.split(' ')[0] for i in person_list]
      lastname = [i.split(' ')[1] for i in person_list]            
      name = ''
      person = []

  return person_list, firstname, lastname


names = data.email_body.apply(get_human_names) 

columns = ['names','firstname','lastname' ]

data_2 = pd.DataFrame([names[0],names[1],names[2]], columns = columns)


data_2

我正在获取以下数据集:

id          names                                                firstname                   lastname
0   [Lesley Kirchman, Milap Majmundar, Segoe UI]    [Lesley, Milap, Segoe]  [Kirchman, Majmundar, UI]
1   [Gerrit Boerman, Lesley Kirchman, Segoe UI] [Gerrit, Lesley, Segoe] [Boerman, Kirchman, UI]
2   [Lesley Kirchman]                                  [Lesley]    [Kirchman]

您可以观察到我只有3行,如何将函数应用于整个初始数据框“数据”,从而获得具有10行的结果数据框?

此致

1 个答案:

答案 0 :(得分:0)

我找到了解决方法:

   data_2 = pd.DataFrame.from_items(zip(names.index, names.values)).T

此致