我使用SpaCy
与Pandas
一起使用词性(POS)导出到excel的句子标记。代码如下:
import spacy
import xlsxwriter
import pandas as pd
nlp = spacy.load('en_core_web_sm')
text ="""He is a good boy."""
doc = nlp(text)
for token in doc:
x=[token.text, token.lemma_, token.pos_, token.tag_,token.dep_,token.shape_, token.is_alpha, token.is_stop]
print(x)
当我print(x)
时,我得到以下内容:
['He', '-PRON-', 'PRON', 'PRP', 'nsubj', 'Xx', True, False]
['is', 'be', 'VERB', 'VBZ', 'ROOT', 'xx', True, True]
['a', 'a', 'DET', 'DT', 'det', 'x', True, True]
['good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False]
['boy', 'boy', 'NOUN', 'NN', 'attr', 'xxx', True, False]
['.', '.', 'PUNCT', '.', 'punct', '.', False, False]
对于令牌循环,我添加了DataFrame,如下所示: 对于doc中的令牌:
for token in doc:
x=[token.text, token.lemma_, token.pos_, token.tag_,token.dep_,token.shape_, token.is_alpha, token.is_stop]
df=pd.Dataframe(x)
print(df)
现在,我使用stat来获得以下格式:
0
0 He
1 -PRON-
2 PRON
3 PRP
4 nsubj
5 Xx
6 True
7 False
........
........
但是,当我尝试使用Pandas
将输出(df)导出为excel时,如下面的代码,它只显示列中x的最后一次迭代
df=pd.DataFrame(x)
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
df.to_excel(writer,sheet_name='Sheet1')
输出(在Excel表格中):
0
0 .
1 .
2 PUNCT
3 .
4 punct
5 .
6 False
7 False
如何在本场景的新列中依次进行所有迭代?
0 He is ….
1 -PRON- be ….
2 PRON VERB ….
3 PRP VBZ ….
4 nsubj ROOT ….
5 Xx xx ….
6 True True ….
7 False True ….
答案 0 :(得分:0)
如果您还没有版本:
import pandas as pd
rows =[
['He', '-PRON-', 'PRON', 'PRP', 'nsubj', 'Xx', True, False],
['is', 'be', 'VERB', 'VBZ', 'ROOT', 'xx', True, True],
['a', 'a', 'DET', 'DT', 'det', 'x', True, True],
['good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False],
['boy', 'boy', 'NOUN', 'NN', 'attr', 'xxx', True, False],
['.', '.', 'PUNCT', '.', 'punct', '.', False, False],
]
headers = ['text', 'lemma', 'pos', 'tag', 'dep',
'shape', 'is_alpha', 'is_stop']
# example 1: list of lists of dicts
#following https://stackoverflow.com/a/28058264/1758363
d = []
for row in rows:
dict_ = {k:v for k, v in zip(headers, row)}
d.append(dict_)
df = pd.DataFrame(d)[headers]
# example 2: appending dicts
df2 = pd.DataFrame(columns=headers)
for row in rows:
dict_ = {k:v for k, v in zip(headers, row)}
df2 = df2.append(dict_, ignore_index=True)
#example 3: lists of dicts created with map() function
def as_dict(row):
return {k:v for k, v in zip(headers, row)}
df3 = pd.DataFrame(list(map(as_dict, rows)))[headers]
def is_equal(df_a, df_b):
"""Substitute for pd.DataFrame.equals()"""
return (df_a == df_b).all().all()
assert is_equal(df, df2)
assert is_equal(df2, df3)
答案 1 :(得分:0)
一些较短的代码:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm')
text ="""He is a good boy."""
param = [[token.text, token.lemma_, token.pos_,
token.tag_,token.dep_,token.shape_,
token.is_alpha, token.is_stop] for token in nlp(text)]
df=pd.DataFrame(param)
headers = ['text', 'lemma', 'pos', 'tag', 'dep',
'shape', 'is_alpha', 'is_stop']
df.columns = headers