我正在使用BeautifulSoup用python抓取一个Cyrillic网站,但遇到了一些麻烦,每个单词都显示如下:
СилÑанÐÐÐÐÐÐÐÐÐÐавÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐ
我也尝试了其他一些西里尔文网站,但是它们运行良好。
我的代码是这样的:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
我应该如何解决?
答案 0 :(得分:2)
requests
无法将其检测为utf-8
。
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly