import bs4 as bs
import urllib.request
import re
import os
from colorama import Fore, Back, Style, init
init()
def highlight(word):
if word in keywords:
return Fore.RED + str(word) + Fore.RESET
else:
return str(word)
for newurl in newurls:
url = urllib.request.urlopen(newurl)
soup1 = bs.BeautifulSoup(url, 'lxml')
paragraphs =soup1.findAll('p')
print (Fore.GREEN + soup1.h2.text + Fore.RESET)
print('')
for paragraph in paragraphs:
if paragraph != None:
textpara = paragraph.text.strip().split(' ')
colored_words = list(map(highlight, textpara))
print(" ".join(colored_words).encode("utf-8")) #encode("utf-8")
else:
pass
我将列出关键词和网址列表。 在网址中运行了几个关键字后,我得到了这样的输出
b'\x1b[31mthe desired \x1b[31mmystery corners \x1b[31mthe differential .
\x1b[31mthe back \x1b[31mpretends to be \x1b[31mthe'
我删除了encode(“ utf-8”)并出现编码错误
Traceback (most recent call last):
File "C:\Users\resea\Desktop\Python Projects\Try 3.py", line 52, in
<module>
print(" ".join(colored_words)) #encode("utf-8")
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 41, in
write
self.__convertor.write(text)
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 162,
in write
self.write_and_convert(text)
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 190,
in write_and_convert
self.write_plain_text(text, cursor, len(text))
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 195, in
write_plain_text
self.wrapped.write(text[start:end])
File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in
position 23: character maps to <undefined>
我要去哪里错了?
答案 0 :(得分:0)
我知道我要建议的不是“解决方案”,而是更多的解决方法,但是我一次又一次地因必须处理“对此编码”的各种奇怪字符而感到沮丧“编码”,有时成功,但很多时候却失败。
根据您newurl
中使用的文本类型,问题字符的范围可能很有限。因此,我将视具体情况进行处理:每当我遇到这些错误之一时,我都会这样做:
import unicodedata
unicodedata.name('\u2019')
以您为例,您将获得以下信息:
'RIGHT SINGLE QUOTATION MARK'
旧的,讨厌的,正确的单引号...因此,as suggested here,我只是用一个看起来像它的讨厌的字符替换了那个讨厌的字符,但没有引发错误;根据您的情况
colored_words = list(map(highlight, textpara)).replace(u"\u2019", "'") # or some other replacement character
应该工作。然后您冲洗并在每次出现此错误时重复一次。诚然,这不是最优雅的解决方案,但是过了一会儿,您的newurl
中所有可能的奇怪字符都被捕获,错误停止了。