我想计算一个csv文件行中存在多少代词,名词和动词。并用新列将计数保存在文件本身中。我对NLTK相当陌生
这是我截断的clues.csv
id STORY
0 Sitting for long?
1 In his talk titled, 'Is Sitting the New Smoking?'
2 Prolonged sitting is an independent risk factor, even if you exercise regularly.
import pandas as pd
import csv
import nltk
from nltk.tag import pos_tag
from nltk import word_tokenize
from collections import Counter
news = pd.read_csv("clues.csv")
news['token'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['token']), axis=1)
tag_count_df = pd.DataFrame(news['pos_tags'].map(lambda x: Counter(tag[1] for tag in x)).to_list())
news['count']=pd.concat([news, tag_count_df], axis=1).fillna(0).drop(['pos_tags', 'token'], axis=1)
news.to_csv("clues.csv")
我一直在获取ValueError:错误的项目数传递了50,放置位置表示为1
答案 0 :(得分:1)
像这样?
df = pd.read_clipboard(sep='\s\s+')
df['token'] = df.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
df['pos_tags'] = df.apply(lambda row: nltk.pos_tag(row['token']), axis=1)
tag_count_df = pd.DataFrame(df['pos_tags'].map(lambda x: Counter(tag[1] for tag in x)).to_list())
pd.concat([df, tag_count_df], axis=1).fillna(0).drop(['pos_tags', 'token'], axis=1)
出局:
id STORY '' , . CD \
0 0 Sitting for long? 0.0 0.0 1 0.0
1 1 In his talk titled, 'Is Sitting the New Smoking?' 1.0 1.0 1 1.0
2 2 Prolonged sitting is an independent risk facto... 0.0 1.0 1 0.0
DT IN JJ NN NNP PRP PRP$ RB VBG VBN VBP VBZ
0 0.0 1 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1 1.0 1 0.0 1.0 2.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0
2 1.0 1 2.0 3.0 0.0 1.0 0.0 2.0 0.0 0.0 1.0 1.0