NLTK python标记化CSV文件

时间:2015-06-01 10:58:23

标签: python-2.7 csv nltk tokenize

我已经开始尝试使用Python和NLTK。 我遇到了一条冗长的错误消息,我无法找到解决方案,并希望您有任何见解。

import nltk,csv,numpy 
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
tokenData = nltk.word_tokenize(reader)

我在OSX Yosemite上运行Python 2.7和最新的nltk软件包。 这些也是我试过的两行代码,结果没有区别:

with open("Medium_Edited.csv", "rU") as csvfile:
tokenData = nltk.word_tokenize(reader)

这些是我看到的错误消息:

Traceback (most recent call last):
  File "nltk_text.py", line 11, in <module>
    tokenData = nltk.word_tokenize(reader)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

提前致谢

2 个答案:

答案 0 :(得分:4)

正如您可以在Python csv documentation中读到的那样,csv.reader&#34;返回一个读取器对象,它将迭代给定csvfile中的行&#34;。换句话说,如果要对csv文件中的文本进行标记,则必须遍历这些行中的行和字段:

for line in reader:
    for field in line:
        tokens = word_tokenize(field)

此外,当您在脚本开头导入word_tokenize时,应将其称为word_tokenize,而不是nltk.word_tokenize。这也意味着您可以删除import nltk语句。

答案 1 :(得分:1)

它提供了错误 - 期望的字符串或缓冲区,因为您忘记将str添加为

tokenData = nltk.word_tokenize(str(reader))