我正在尝试清理一些文字。我只保留字母和数字。但是,我的文本仍然包含其他字符。
这是我的功能:
def review_to_wordlist(review, remove_stopwords=False, remove_numbers = False ):
# Function to convert a document to a sequence of words,
# optionally removing stop words and numbers. Returns a list of words.
#
# 1. Remove HTML
review_text = BeautifulSoup(review).get_text()
#
# 2. Remove non-letters
if True:
review_text = re.sub("[^a-zA-Z0-9]"," ", review_text)
#
# 3. Convert words to lower case and split them
words = review_text.lower().split()
#
# 4. Optionally remove stop words (false by default)
if remove_stopwords:
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]
#
# 5. Return a list of words
return(words)
这是我得到的一个结果:
NuTone中央真空系统45 EllOhio Steel Tandem Natural和 合成草坪清扫系统独特的家居设计36英寸x 80英寸苏 Casa Black表面安装外侧钢安全门与扩展 金属屏独特家居设计36英寸x 80英寸Su Casa Black Surface 外置式钢制防盗门,带扩展金属屏幕,独特 Home Designs 36英寸x 80英寸.Su Casa黑色表面贴装外胎 钢制防盗门与扩展金属屏MP全球最佳400英寸。 x 36英寸x 1/8英寸。带有薄膜的声学再生纤维衬垫 Laminate Wood MP全球最佳400英寸x 36英寸x 1/8英寸吸音 用于层压木材夹具的再生纤维衬垫 #10-1 / 4英寸x 2-1 / 2英寸8 亮钢环形柄普通钉(1磅装)
我得到的错误是:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-6: unexpected end of data
676
Husky Pneumatic 3-1/2 in. 21� Full-Head Strip Framing Nailer
5157
RIDGID 3-1/2 in. 21� Round-Head Nailer
5158
RIDGID 3-1/2 in. 21� Round-Head Nailer