将字符串数据移到任意数量的任意值的新列中

时间:2019-03-28 23:10:50

标签: python pandas data-structures nlp

我正在从包含字符串数据的列中提取专有名词。我想将提取的名词作为列表移动到新列中(或者,每增加一列作为一个名词)。我提取的每个条目都有一个任意(有时是很大)名词。

我已经完成提取并将我感兴趣的值移动到列表中,但是由于存在以下原因,我无法弄清楚如何将它们作为列添加到从中提取它们的情况我提取的列表与其需要与单行对应的事实之间的长度差异。

    from nltk.tokenize import PunktSentenceTokenizer

    data = []
    norm_data['words'] = []
    for sent in norm_data['gtd_summary']:
        sentences = nltk.sent_tokenize(sent) 
        data = data + nltk.pos_tag(nltk.word_tokenize(sent))
        for word in data: 
            if 'NNP' in word[1]: 
                nouns = list(word)[0]
                norm_data['words'].append(nouns)

当前数据

X   Y
1   Joe Montana walks over to the yard
2   Steve Smith joins the Navy
3   Anne Johnson wants to go to a club
4   Billy is interested in Sally

我想要的

X   Y                                       Z
1   Joe Montana walks over to the yard      [Joe, Montana]
2   Steve Smith joins the Navy              [Steve, Smith, Navy]
3   Anne Johnson wants to go to a club      [Anne, Johnson]
4   Billy is interested in Sally            [Billy, Sally]

或者这也可以

    X   Y                                       Z      L            M
    1   Joe Montana walks over to the yard      Joe    Montana      NA
    2   Steve Smith joins the Navy              Steve  Smith        Navy
    3   Anne Johnson wants to go to a club      Anne   Johnson      NA
    4   Billy is interested in Sally            Billy  Sally        NA

1 个答案:

答案 0 :(得分:0)

您可以构建一个包含列表的系列。在循环之后,将Z列添加到数据框(我想您的数据在数据框中?)

# Init before the loop
noun_series = pd.Series()
    ...
    # Build up series 
    nouns = list(word)[0]
    noun_series.at[index] = nouns
    index += 1
    ...
# After the loop - add the Z column
df['Z'] = noun_series

不过,您需要正确设置索引,以使其与正确的行匹配。