尽管我设置了编码以检测土耳其语字符,但是它无法捕获并正确显示此网页。它确实适用于与此页面相同且位于同一字符集和域下的所有其他页面。我不明白为什么会这样?任何想法 ?预先感谢!
例如:
BilgisayarMühendisliÄŸiBölüm¼
而不是:
BilgisayarMühendisliğiBölümü
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
url = "http://bmb.osmaniye.edu.tr/personel-akademik"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser', from_encoding="utf-8")
print(soup.original_encoding)
print(soup)
输出:
windows-1252
<!DOCTYPE html>
<html lang="en"><head>
<title>Osmaniye Korkut Ata Üniversitesi - Bilgisayar Mühendisliği Bölümü</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!-------------<meta http-equiv="Content-Type" content="text/html; charset=windows-1254" />---------------->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="" name="google-site-verification">
<meta content=",Bilgisayar Mühendisliği Bölümü" name="keywords"/>
答案 0 :(得分:0)
对于您将来的网络抓取工作,您可能想先尝试一下:
page.encoding = page.apparent_encoding
或根据建议使用反斜杠替换进行解码。
例如:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://bmb.osmaniye.edu.tr/personel-akademik")
soup = BeautifulSoup(page.content.decode("utf-8", "backslashreplace"), 'html.parser').find("title").getText(strip=True)
print(soup)
给你这个:
Osmaniye Korkut Ata Üniversitesi - Bilgisayar Mühendisliği Bölümü