Question

我正在做一个小项目。我想用这个

打开一个url.i

url = 'http://www.ygdy8.net/html/gndy/dyzz/index.html'
content = urllib.request.urlopen(url).read() 

pat = re.compile('<div class="title_all"><h1><front color=#008800>.*?</a>>   </front></h1></div>'+ '(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> ',re.S)
txt = ''.join(pat.findall(content))

但是这给了我错误

TypeError: can't use a string pattern on a bytes-like object

然后我尝试了

txt = ''.join(pat.findall(content.decode()))

但也有错误

    txt = ''.join(pat.findall(content.decode()))
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 251: invalid start byte

我寻找答案，但我不知道如何解决它。

Answer 1

标题暗示content.decode（'gb2312'，errors ='ignore'）应该有效。

>>> content.find(b'charset')
226
>>> content[226:226 + 20]
b'charset=gb2312">\r\n<t'

但是，你的正则表达式肯定不会起作用。您有front而不是font。也许你想要以下内容：

>>> pat = re.compile(r'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>'+ r'(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> ',re.S)

据我所知，这可以抓住两件作品中的表格。

>>> txt = ''.join(pat.findall(content.decode('gb2312',errors='ignore')))
>>> print(txt[:500])

<div class="co_content8">
<ul>

<td height="220" valign="top"> <table width="100%" border="0" cellspacing="0" cellpadding="0" class="tbspan" style="margin-top:6px">
<tr> 
<td height="1" colspan="2" background="/templets/img/dot_hor.gif"></td>
</tr>
<tr> 
<td width="5%" height="26" align="center"><img src="/templets/img/item.gif" width="18" height="17"></td>
<td height="26">
    <b>

        <a href="/html/gndy/dyzz/20160920/52002.html" class="ulink">2016年井柏然杨颖《微微一笑很倾城》HD国语中字</a>
    </b>
<
>>> pat.pattern
'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> '
>>>

UnicodeDecodeError：'utf-8'编解码器无法解码251位的字节0xb5：无效的起始字节

1 个答案: