关于使用python进行html报废的澄清

时间:2018-03-06 02:33:00

标签: python html python-3.x lxml

我自学了如何从网站获取信息,我对如何实际使用lxml感到困惑。说我想打印this维基百科页面内容的标题。我会先:

site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)

但现在我不知道哪个是正确的xpath。我天真地突出了页面上的内容块,我只是把

contents=tree.xpath('//*[@id="toc"]/div/h2')

这当然不能给我我想要的东西。 (我得到一个空数组)。我该怎么做?

2 个答案:

答案 0 :(得分:0)

from lxml import html 
import requests
site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)
contents=tree.xpath('//*[@id="toc"]/div/h2/text()')[0]
print(contents)

您可以在Chrome中打开chrome.Open'https://en.wikipedia.org/wiki/Hamiltonian_mechanics'中的xpath,然后按F12。在控制台中输入$x('//*[@id="toc"]/div/h2/'),会输出h2元素。如果要显示h2的内容,xpath应该是$x('//*[@id="toc"]/div/h2/text()'),结果是内容数组。

答案 1 :(得分:0)

如果我理解你想要父标题,如果你分析你的结构:

//div[@id="toc"]/ul/li/a/span[@class="toctext"]此路径到达所有标题,因此对于全部检索,代码将为:

from lxml import html 
import requests
site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)
contents=tree.xpath('//div[@id="toc"]/ul/li/a/span[@class="toctext"]/text()')
print(contents)

输出为:

['Overview', "Deriving Hamilton's equations", 'As a reformulation of Lagrangian mechanics', 'Geometry of Hamiltonian systems', 'Generalization to quantum mechanics through Poisson bracket', 'Mathematical formalism', 'Riemannian manifolds', 'Sub-Riemannian manifolds', 'Poisson algebras', 'Charged particle in an electromagnetic field', 'Relativistic charged particle in an electromagnetic field', 'See also', 'References', 'External links']

但是如果你想拥有儿童头衔,你可以获得所有的li和迭代:

import requests
import json
from lxml import html 
site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)
contents=tree.xpath('//div[@id="toc"]/ul/li')
title_dic = {}
for content in contents:
    subcontents = content.xpath('ul/li/a/span[@class="toctext"]/text()')
    title_dic[content.xpath('a/span[@class="toctext"]/text()')[0]] = subcontents
print(json.dumps(title_dic, indent = 4))

输出是:

{
    "Overview": [
        "Basic physical interpretation",
        "Calculating a Hamiltonian from a Lagrangian"
    ],
    "Deriving Hamilton's equations": [],
    "As a reformulation of Lagrangian mechanics": [],
    "Geometry of Hamiltonian systems": [],
    "Generalization to quantum mechanics through Poisson bracket": [],
    "Mathematical formalism": [],
    "Riemannian manifolds": [],
    "Sub-Riemannian manifolds": [],
    "Poisson algebras": [],
    "Charged particle in an electromagnetic field": [],
    "Relativistic charged particle in an electromagnetic field": [],
    "See also": [],
    "References": [
        "Footnotes",
        "Sources"
    ],
    "External links": []
}

并且您将父标题作为字典的键,如果它们存在,则值为子项。