我正在从包含字符串数据的列中提取专有名词。我想将提取的名词作为列表移动到新列中(或者,每增加一列作为一个名词)。我提取的每个条目都有一个任意(有时是很大)名词。
我已经完成提取并将我感兴趣的值移动到列表中,但是由于存在以下原因,我无法弄清楚如何将它们作为列添加到从中提取它们的情况我提取的列表与其需要与单行对应的事实之间的长度差异。
from nltk.tokenize import PunktSentenceTokenizer
data = []
norm_data['words'] = []
for sent in norm_data['gtd_summary']:
sentences = nltk.sent_tokenize(sent)
data = data + nltk.pos_tag(nltk.word_tokenize(sent))
for word in data:
if 'NNP' in word[1]:
nouns = list(word)[0]
norm_data['words'].append(nouns)
当前数据
X Y
1 Joe Montana walks over to the yard
2 Steve Smith joins the Navy
3 Anne Johnson wants to go to a club
4 Billy is interested in Sally
我想要的
X Y Z
1 Joe Montana walks over to the yard [Joe, Montana]
2 Steve Smith joins the Navy [Steve, Smith, Navy]
3 Anne Johnson wants to go to a club [Anne, Johnson]
4 Billy is interested in Sally [Billy, Sally]
或者这也可以
X Y Z L M
1 Joe Montana walks over to the yard Joe Montana NA
2 Steve Smith joins the Navy Steve Smith Navy
3 Anne Johnson wants to go to a club Anne Johnson NA
4 Billy is interested in Sally Billy Sally NA
答案 0 :(得分:0)
您可以构建一个包含列表的系列。在循环之后,将Z列添加到数据框(我想您的数据在数据框中?)
# Init before the loop
noun_series = pd.Series()
...
# Build up series
nouns = list(word)[0]
noun_series.at[index] = nouns
index += 1
...
# After the loop - add the Z column
df['Z'] = noun_series
不过,您需要正确设置索引,以使其与正确的行匹配。