Question

因此，我是全新的整个网络抓取工具。我一直在从事一个项目，该项目需要我从here获得最新消息。我现在已经成功地抓住了这个词，我只需要获取定义，但是当我这样做时，我得到了以下结果：

Avuncular（每日正确的单词）

定义：

[]

这是我的代码：

from lxml import html
import requests

page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = html.fromstring(page.content)

word = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[1]/div[2]/div[1]/div/h1/text()')

WOTD = str(word)
WOTD = WOTD[2:]
WOTD = WOTD[:-2]

print(WOTD.capitalize())


print("Definition:")

wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[2]/div[1]/div/div[1]/p[1]/text()')

print(wordDef)

[]应该是第一个定义，但由于某些原因而无法使用。

任何帮助将不胜感激。

Answer 1

您的xpath稍微关闭了。这是正确的：

ID, Budget, Spend, Week, Status
1,  50,     50,    1,    base
1   50,     55,    2,    over
1   50,     50,    3,    base
1   50,     250,   4,    over
1   50,     300,   5,    over
1   50,     42,    6,    under.

在主体/文章之后注意div [3]，而不是div [2]。现在，在运行时，您应该获得：

wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[3]/div[1]/div/div[1]/p[1]/text()')

Answer 2

如果要避免在xpath中对索引进行硬编码，则可以使用以下方法替代当前尝试：

import requests
from lxml.html import fromstring

page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = fromstring(page.text)
word = tree.xpath("//*[@class='word-header']//h1")[0].text
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p/strong")[0].tail.strip()
print(f'{word}\n{wordDef}')

如果wordDef无法获得全部部分，请尝试用以下内容替换：

wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p")[0].text_content()

输出：

avuncular
suggestive of an uncle especially in kindliness or geniality

为什么我无法从网站取回任何数据？

2 个答案: