由于编码问题,无法正确抓取网页

时间:2020-11-11 14:15:12

标签: python python-3.x beautifulsoup encoding request

尽管我设置了编码以检测土耳其语字符,但是它无法捕获并正确显示此网页。它确实适用于与此页面相同且位于同一字符集和域下的所有其他页面。我不明白为什么会这样?任何想法 ?预先感谢!

例如:

BilgisayarMühendisliÄŸiBölüm¼

而不是:

BilgisayarMühendisliğiBölümü

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

url = "http://bmb.osmaniye.edu.tr/personel-akademik"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser', from_encoding="utf-8")

print(soup.original_encoding)
print(soup)

输出:

windows-1252
<!DOCTYPE html>

<html lang="en"><head>
<title>Osmaniye Korkut Ata Üniversitesi - Bilgisayar Mühendisliği Bölümü</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!-------------<meta http-equiv="Content-Type" content="text/html; charset=windows-1254" />---------------->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="" name="google-site-verification">
<meta content=",Bilgisayar Mühendisliği Bölümü" name="keywords"/>

1 个答案:

答案 0 :(得分:0)

对于您将来的网络抓取工作,您可能想先尝试一下:

page.encoding = page.apparent_encoding

或根据建议使用反斜杠替换进行解码。

例如:

import requests
from bs4 import BeautifulSoup

page = requests.get("http://bmb.osmaniye.edu.tr/personel-akademik")
soup = BeautifulSoup(page.content.decode("utf-8", "backslashreplace"), 'html.parser').find("title").getText(strip=True)
print(soup)

给你这个:

Osmaniye Korkut Ata Üniversitesi - Bilgisayar Mühendisliği Bölümü