这是一个例子。说我有一段文字
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
paragraph = "There was a steaming mist in all the hollows, and it had roamed in its forlornness up the hill, like an evil spirit, seeking rest and finding none. A clammy and intensely cold mist, it made its slow way through the air in ripples that visibly followed and overspread one another, as the waves of an unwholesome sea might do. It was dense enough to shut out everything from the light of the coach-lamps but these its own workings, and a few yards of road; and the reek of the labouring horses steamed into it, as if they had made it all."
nltk.tokenize.sent_tokenize(paragraph)
输出是
['There was a steaming mist in all the hollows, and it had roamed in its forlornness up the hill, like an evil spirit, seeking rest and finding none.',
'A clammy and intensely cold mist, it made its slow way through the air in ripples that visibly followed and overspread one another, as the waves of an unwholesome sea might do.',
'It was dense enough to shut out everything from the light of the coach-lamps but these its own workings, and a few yards of road; and the reek of the labouring horses steamed into it, as if they had made it all.']
我将段落放入标记化的单词中,我可以进行各种文本预处理
word_tokenize(paragraph)
for word in paragraph:
word.lower
lemmas = [WordNetLemmatizer(word) for word in paragraph]
等。
但是通常这些预处理方法采用单个单词的字符串输入
如何预处理句子并将其保留在该数据结构中?