Question

我想使用NLTK库标记文本语料库。

我的语料库如下：

['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?",

我尝试过：

tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]

其中提出：

AttributeError：'str'对象没有属性'decode'

会有所帮助。谢谢。

Answer 1

错误就在那里，sent没有属性decode。如果它们是第一次编码的，则只需要.decode()个对象，即bytes个对象而不是str个对象。删除它，应该没问题。

Answer 2

this page建议word_tokenize方法期望将字符串作为参数，只需尝试

tok_corp = [nltk.word_tokenize(sent) for sent in corpus]

编辑：使用以下代码，我可以获得标记化的语料，

代码：

import pandas as pd
from nltk import word_tokenize

corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?"]


tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])

输出：

      0     1     2           3        4   ...    13    14    15    16    17
0    Did   you  hear       about      the  ...   tea     ?  None  None  None
1   What    's   the        best     anti  ...  None  None  None  None  None
2   What    do   you        call        a  ...    no  arms   nor  legs     ?
3  Which  Star  Trek   character       is  ...  None  None  None  None  None
4   What    's   the  difference  between  ...  None  None  None  None  None

我认为您的语料库中潜入了一些非字符串或非字节状的对象。我建议您再次检查。

如何标记文本语料库？

2 个答案: