Question

我试图将mpqa政治辩论辩论语料库标记为：

import nltk
from sklearn.datasets import load_files
categories=['abortion', 'creation', 'gayRights', 'god', 'guns', 'healthcare']
dataset= load_files(r'C:\Users\kahnl\svm tutorial\SomasundaranWiebe-politicalDebates', categories=categories)
from nltk.tokenize import sent_tokenize
sentences= sent_tokenize(dataset)

给出错误：

 Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    sentences= sent_tokenize(dataset)
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\__init__.py", line 95, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp>
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\punkt.py", line 312, in _pair_iter
    prev = next(it)
  File "C:\Users\kahnl\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object

获取TypeError：尝试在数据集

0 个答案: