我想使用NLTK库标记文本语料库。
我的语料库如下:
['Did you hear about the Native American man that drank 200 cups of tea?',
"What's the best anti diarrheal prescription?",
'What do you call a person who is outside a door and has no arms nor legs?',
'Which Star Trek character is a member of the magic circle?',
"What's the difference between a bullet and a human?",
我尝试过:
tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]
其中提出:
AttributeError:'str'对象没有属性'decode'
会有所帮助。谢谢。
答案 0 :(得分:1)
错误就在那里,sent
没有属性decode
。如果它们是第一次编码的,则只需要.decode()
个对象,即bytes
个对象而不是str
个对象。删除它,应该没问题。
答案 1 :(得分:1)
this page建议word_tokenize方法期望将字符串作为参数,只需尝试
tok_corp = [nltk.word_tokenize(sent) for sent in corpus]
编辑:使用以下代码,我可以获得标记化的语料,
代码:
import pandas as pd
from nltk import word_tokenize
corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
"What's the best anti diarrheal prescription?",
'What do you call a person who is outside a door and has no arms nor legs?',
'Which Star Trek character is a member of the magic circle?',
"What's the difference between a bullet and a human?"]
tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])
输出:
0 1 2 3 4 ... 13 14 15 16 17
0 Did you hear about the ... tea ? None None None
1 What 's the best anti ... None None None None None
2 What do you call a ... no arms nor legs ?
3 Which Star Trek character is ... None None None None None
4 What 's the difference between ... None None None None None
我认为您的语料库中潜入了一些非字符串或非字节状的对象。我建议您再次检查。