Question

hxs = lxml.html.document_fromstring（requests.get（“http://www.imdb.com/title/”+ id）.content）

movie = {}

try:
    movie['title'] = hxs.xpath('//*[@id="overview-top"]/h1/span[1/text()'[0].strip()


except IndexError:
    movie['title']

我无法理解“hxs.xpath（'// * [@ id =”overview-top“] / h1 / span [1] / text（）'）[0] .strip（）“

Answer 1

以下函数 lxml.html.document_fromstring（string）从给定字符串中解析文档。这总是创建一个正确的HTML文档，这意味着父节点是，并且有一个正文，可能还有一个头。

hxs = lxml.html.document_fromstring(requests.get("http://www.imdb.com/title/" + id).content)

您可以使用此代码查看html。

print lxml.html.tostring(hxs)

考虑标题为 tt1905041 的IMDb电影，请考虑网页的html源代码，

<td id="overview-top">
    <h1 class="header"> <span class="itemprop" itemprop="name">Fast &amp; Furious 6</span>
        <span class="nobr">(<a href="/year/2013/?ref_=tt_ov_inf" >2013</a>)</span>
    </h1>
</td>

因此我们需要标题，我们从外部html解析它，

[@ id =＆＃34; overview-top＆＃34;] 会选择所需的ID元素

h1

span 1 由于有多个span元素，我们选择第一个。与我们解析html类似，我们得到以下代码，

movie['title'] = hxs.xpath('//*[@id="overview-top"]/h1/span[1]/text()')[0].strip()
print movie['title']

输出：愤怒6

有关XPath的更多信息here

使用python和lxml抓取imdb数据库

1 个答案: