Question

我想在这个新闻网站上获取数据。 http://www.inquirer.net/

我想抓住瓷砖上的新闻标题。

这是检查代码的屏幕截图

正如你所看到的，我想要抓住的瓷砖标题之一已经存在。当我从浏览器复制xpath时，它返回// * [@ id =＆＃34; tgs3_info＆＃34;] / h2

我试着运行我的python代码。

import lxml.html
import lxml.etree
import requests

link = 'http://www.inquirer.net/'
res = requests.get(link)
r = res.content
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
print(root)

但它返回一个空列表。

我试图在stackoverflow和互联网上搜索答案。我真的不懂。当您查看该站点的页面源时。我想要的数据不在javascript函数中。它在div中，所以我不明白为什么我无法获取数据。我希望我能在这里找到答案。

Answer 1

使用Xurasky解决方案的输入来避免403错误

import lxml.html
import lxml.etree
from urllib.request import Request, urlopen

req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
r = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
for a in root:
    print(a.text_content())

输出

Duterte, Roque meeting set in Malacañang
2 senators welcome Ventura's revelations in Atio hazing case
Paolo Duterte vows to retire from politics in 2019
NBA: DeMarcus Cousins regrets being loyal to Sacramento Kings
PH bet Elizabeth Durado Clenci wins 2nd runner-up at Miss Grand International 2017
DOJ wants Divina, 50 others in `Atio' hazing case added on BI watchlist
Georgina Wilson Shares Messages From Fans on Baby Blues

Answer 2

我相信你得到的是urllib.error.HTTPError：HTTP错误403：禁止错误。

您可以使用

解决此问题

import lxml.html
import lxml.etree
from urllib.request import Request, urlopen

req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
res = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
print(root)

如何在JavaScript网站上使用XPath获取数据？

2 个答案: