如何使用Python进行网页抓取时修复西里尔字母

时间:2019-04-22 21:10:51

标签: python web-scraping beautifulsoup character-encoding cyrillic

我正在使用BeautifulSoup用python抓取一个Cyrillic网站,但遇到了一些麻烦,每个单词都显示如下:

  

СилÑанÐÐÐÐÐÐÐÐÐÐавÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐ

我也尝试了其他一些西里尔文网站,但是它们运行良好。

我的代码是这样的:

from bs4 import BeautifulSoup
import requests

source = requests.get('https://time.mk/').text

soup = BeautifulSoup(source, 'lxml')

print(soup.prettify())

我应该如何解决?

1 个答案:

答案 0 :(得分:2)

requests无法将其检测为utf-8

from bs4 import BeautifulSoup
import requests

source = requests.get('https://time.mk/')  # don't convert to text just yet

# print(source.encoding)
# prints out ISO-8859-1

source.encoding = 'utf-8'  # override encoding manually

soup = BeautifulSoup(source.text, 'lxml')  # this will now decode utf-8 correctly