Question

我想提取td中其类为u-ctitle的所有链接。

import os
import urllib
import lxml.html
down='http://v.163.com/special/opencourse/bianchengdaolun.html'
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
namelist=root.xpath('//td[@class="u-ctitle"]/a')
len(namelist)

输出是[]，有很多td，其类别为“u-ctitle”，你可以获得firebug，为什么不提取它？

我的python版本是2.7.9

将文件更改为其他名称是没有用的。

Answer 1

您的XPath是正确的。问题无关紧要。

如果您检查HTML，您将看到以下元标记：

file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)

在此代码中：

file

document_fromstring实际上是一个字节序列，因此在>>> file.decode('gbk') Traceback (most recent call last): File "down.py", line 9, in <module> file.decode('gbk') UnicodeDecodeError: 'gbk' codec can't decode bytes in position 7247-7248: illegal multibyte sequence方法中会发生从GBK编码的字节到Unicode字符串的解码。

问题是，HTML编码实际上并不是GBK，而lxml错误地解码它会导致数据丢失。

root=lxml.html.document_fromstring(file.decode('GB18030'))

经过一些试验和错误后，我们发现实际编码为GB_18030。要使脚本工作，您需要手动解码字节：

{{1}}

我的xpath表达式有什么问题？

1 个答案: