我尝试使用send_tokenize和word_tokenize通过数据进行标记化。
下面是我的伪数据
**text** Hello world, how are you I am fine, thank you!
我正在尝试使用以下代码对其进行标记
import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
Corpus=pd.read_csv(r"C:\Users\Desktop\NLP\corpus.csv",encoding='utf-8')
Corpus['text']=Corpus['text'].apply(sent_tokenize)
Corpus['text_new']=Corpus['text'].apply(word_tokenize)
但是它给出了以下错误
Traceback (most recent call last):
File "C:/Users/gunjit.bedi/Desktop/NLP Project/Topic Classification.py", line 24, in <module>
Corpus['text_new']=Corpus['text'].apply(word_tokenize)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 3192, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1472, in pandas._libs.lib.map_infer
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 95, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1241, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1291, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1291, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1281, in span_tokenize
for sl in slices:
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1322, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 313, in _pair_iter
prev = next(it)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1295, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object
我确实尝试了很多事情,例如,如果我评论send_tokenize,word_tokenize可以工作,但是两者不能一起工作
答案 0 :(得分:0)
由于nltk.word_tokenize
期望输入为string
,因此出现错误。
在文本上应用nltk.sent_tokenize
会将其转换为列表。
text = ['Hey. Hello','hello world!! I am akshay','I m fine']
df['text']=df['text'].apply(sent_tokenize)
print(df['text'])
输出:
text
0 [Hey., Hello]
1 [hello world!!, I am akshay]
2 [I m fine]
尝试一下
df['sent']=df['text'].apply(lambda x :sent_tokenize(str(x)))
df['text_new']= [word_tokenize(str(i)) for i in df['sent']]