在这段代码中,text
是一个长度为10的列表,其中包含许多字符串。
output=[]
words=[]
words1=[]
import re
for i in range(0,len(text)):
output.append(re.sub(r'\d+', '', text[i]))
words1.append(output[i].split())
# convert to lower case
words.append( [word.lower() for word in words1[i]])
#remove punctuations
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer=tokenizer.tokenize(str(words[12]))
此代码删除了文本中的标点符号。
但是,如果我在循环内编写了tokenizer部分,标点符号就会显示在文本中。
output=[]
words=[]
words1=[]
tokenizer=[]
import re
#remove punctuations
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize
for i in range(0,len(text1)):
output.append(re.sub(r'\d+', '', text1[i]))
words1.append(output[i].split())
# convert to lower case
words.append( [word.lower() for word in words1[i]])
tokenizer.append(RegexpTokenizer(r'\w+'))
tokenizer.append(sent_tokenize(str(words[i])))
附加问题吗?