我正在尝试使用NLTK函数将文本数据转换为SKlearn的数字形式。我使用的数据基本上是短txt数据。
输入
NO 6 JALAN ASTAKA U8/82 SEKSYEN U8 BUKIT JELUTONG
MST GOLF PLAZA NO 8 JALAN SS13/5
预期输出
no jalan astaka u seksyen u bukit jelutong
mst golf plaza no jalan ss
我的代码
user_defined_stop_words = ['kwun','tong']
i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words
newstopwords = set(i).union(j)
def preprocess(x):
x = re.sub('[^a-z\s]', '', x.lower()) # get rid of noise
x = [w for w in x.split() if w not in set(newstopwords)] # remove stopwords
return ' '.join(x)
data['Clean_addr'] = data['Adj_Addr'].apply(preprocess)
错误
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-55-3e3b1d8472ed> in preprocess(x)
5
6 def preprocess(x):
----> 7 x = re.sub('[^a-z\s]', '', x.lower()) # get rid of noise
8 x = [w for w in x.split() if w not in set(newstopwords)] # remove stopwords
9 return ' '.join(x)
AttributeError: 'float' object has no attribute 'lower'
如何解决此问题。