如何标记文本语料库?

时间:2019-08-06 19:15:42

标签: python pandas numpy recommendation-engine

我想使用NLTK库标记文本语料库。

我的语料库如下:

['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?",

我尝试过:

tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]

其中提出:

  

AttributeError:'str'对象没有属性'decode'

会有所帮助。谢谢。

2 个答案:

答案 0 :(得分:1)

错误就在那里,sent没有属性decode。如果它们是第一次编码的,则只需要.decode()个对象,即bytes个对象而不是str个对象。删除它,应该没问题。

答案 1 :(得分:1)

this page建议word_tokenize方法期望将字符串作为参数,只需尝试

tok_corp = [nltk.word_tokenize(sent) for sent in corpus]

编辑:使用以下代码,我可以获得标记化的语料,

代码:

import pandas as pd
from nltk import word_tokenize

corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?"]


tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])

输出:

      0     1     2           3        4   ...    13    14    15    16    17
0    Did   you  hear       about      the  ...   tea     ?  None  None  None
1   What    's   the        best     anti  ...  None  None  None  None  None
2   What    do   you        call        a  ...    no  arms   nor  legs     ?
3  Which  Star  Trek   character       is  ...  None  None  None  None  None
4   What    's   the  difference  between  ...  None  None  None  None  None

我认为您的语料库中潜入了一些非字符串或非字节状的对象。我建议您再次检查。