我有一个pandas数据帧,我试图标记每一行的内容。
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)
当我运行它时,我在第67行收到错误
TypeError: ('expected string or buffer', u'occurred at index 67')
我认为我得到的是因为iloc [67]的'Summary'值是一个NA值。
TextData.Summary.iloc[67]
Out[45]: nan
假设是导致这种情况的na值,有没有办法让word_tokenize或pandas在遇到NA时忽略NA值?
否则,还有什么可能导致这种情况?
答案 0 :(得分:1)
您可以使用fillna()
将NaN替换为指定值:
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData.fillna('some value') # or just: TextData['Summary'].fillna('some value')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)
您可以简单地“消除”该值为空的记录:
TextData = TextData[TextData['tokenized_summary'].notnull()]
使最终产品看起来像:
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData = TextData[TextData['tokenized_summary'].notnull()]
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)