如何计算csv文件行中句子的代词,名词和动词

时间:2019-10-13 14:47:41

标签: python-3.x pandas csv nltk

我想计算一个csv文件行中存在多少代词,名词和动词。并用新列将计数保存在文件本身中。我对NLTK相当陌生

这是我截断的clues.csv

id  STORY
0   Sitting for long?
1   In his talk titled, 'Is Sitting the New Smoking?'
2   Prolonged sitting is an independent risk factor, even if you exercise regularly.
import pandas as pd
import csv
import nltk
from nltk.tag import pos_tag
from nltk import word_tokenize
from collections import Counter

news = pd.read_csv("clues.csv")

news['token'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['token']), axis=1)

tag_count_df = pd.DataFrame(news['pos_tags'].map(lambda x: Counter(tag[1] for tag in x)).to_list())

news['count']=pd.concat([news, tag_count_df], axis=1).fillna(0).drop(['pos_tags', 'token'], axis=1)


news.to_csv("clues.csv")

我一直在获取ValueError:错误的项目数传递了50,放置位置表示为1

1 个答案:

答案 0 :(得分:1)

像这样?

df = pd.read_clipboard(sep='\s\s+')

df['token'] = df.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
df['pos_tags'] = df.apply(lambda row: nltk.pos_tag(row['token']), axis=1)

tag_count_df = pd.DataFrame(df['pos_tags'].map(lambda x: Counter(tag[1] for tag in x)).to_list())

pd.concat([df, tag_count_df], axis=1).fillna(0).drop(['pos_tags', 'token'], axis=1)

出局:

   id                                              STORY   ''    ,  .   CD  \
0   0                                  Sitting for long?  0.0  0.0  1  0.0   
1   1  In his talk titled, 'Is Sitting the New Smoking?'  1.0  1.0  1  1.0   
2   2  Prolonged sitting is an independent risk facto...  0.0  1.0  1  0.0   

    DT  IN   JJ   NN  NNP  PRP  PRP$   RB  VBG  VBN  VBP  VBZ  
0  0.0   1  0.0  0.0  0.0  0.0   0.0  1.0  1.0  0.0  0.0  0.0  
1  1.0   1  0.0  1.0  2.0  0.0   1.0  0.0  1.0  1.0  0.0  0.0  
2  1.0   1  2.0  3.0  0.0  1.0   0.0  2.0  0.0  0.0  1.0  1.0