Question

我正在开发一个程序，该程序可进行爬网以广泛收集URL。原理是打开一个URL，然后获取页面的所有URL，然后继续访问。我内部有问题，我只需要中文网页，不需要其他语言的网页。我尝试使用str.isalpha（）来确定标题是否为英语。但是效果不是很好，这里会出现编码问题，并且gb2312获得的一些标题是

['ä¸\xadå\x9b½å\ x86 \x9bç½\ x91-ä¸\xadå\x9b½äºº°\x91è§£æ\x94¾å\ x86 \x9bå®\x98æ\x96¹å\ x86 \x9bäº\x8bæ\ x96° é\ x97»é\x97¨æ\ x88·']

由于网站太多，我无法确定网站的编码。有什么好的方法可以判断网页是否为中文？帮帮我

我想通过标题来判断它是否是中文网页。但是，编码问题

import re

import requests


def get_html(url):
    req = requests.get(url, headers=headers)
    if req.status_code in [200, 210, 304]:
        _text = req.text
        res = '<title>(.*?)</title>'
        title = re.findall(res, str(_text))
        print(title)
    else:
        print(f"{req.status_code}")


if __name__ == '__main__':
    get_html('http://www.81.cn/')

# This is the code I found to judge Chinese, but there is no coding involved.

 def __call__(self,value):
        if not all([True if i >= u'\u4e00' and i <= u'\u9fa5' else False for i in value]):
            raise ValidationError(self.message, code=self.code)

python如何判断网页是否为中文？

0 个答案: