使用lxml和python如何循环遍历网站div中的所有div?

时间:2017-04-09 22:44:58

标签: python html web-scraping lxml

为了好玩,我试图在python中编写一个脚本,该脚本遍历给定subreddit首页上的所有帖子。我有以下代码:

from lxml import html
import requests

subredditURL = "https://www.reddit.com/r/" + "pics/"
subredditPage = requests.get(subredditURL)
subredditTree = html.fromstring(subredditPage.content)
subreddit_rows_xpath = subredditTree.xpath('//*[@id="siteTable"]')

for div in subreddit_rows_xpath:
    print(div)

现在我认为for循环会打印出与我正在查看的页面上的帖子一样多的div。我认为对于典型的reddit subreddit的首页,这将是25个帖子。我认为这会起作用的原因是当我手动检查siteTable div时,它似乎包含一系列25个带有x_paths的div,其格式如下:siteTable div:

//*[@id="thing_t3_63fuuy"]

其中id似乎是一个随机字符串,并且首页上的每个帖子都有一个这样的div,它们包含我可以探索的帖子的相关信息。

上面的代码不是打印出25个div而是返回:

<Element div at 0x110669f70>

只暗示一个div,而不是我预期的25。我怎么会出错呢?

以下是我正在探索的网址链接,如果有帮助:https://www.reddit.com/r/pics/

1 个答案:

答案 0 :(得分:1)

The expression subredditTree.xpath('//*[@id="siteTable"]') returns a list with only 1 element. So iterating over it using:

for div in subreddit_rows_xpath:
    print(div)

only outputs 1 element, because that's all that exists. If you want to iterate over all of the div elements under subreddit_rows_xpath, you can use:

subreddit_table_divs = subredditTree.xpath('//*[@id="siteTable"]//div')
for div in subreddit_table_divs:
    print(div)

However, I am guessing you want more than just a bunch of lines that look like <Element div at 0x99999999999>. You probably want the either the title or the link to the posts.

To get the titles, you need to drill down two levels to the links:

subreddit_titles = subredditTree.xpath(
    '//*[@id="siteTable"]//div[@class="entry unvoted"]'
    '/p/a[@data-event-action="title"]/text()'
)

To get the links to the images, it is the same path, just grab the href attribute.

subreddit_links = subredditTree.xpath(
    '//*[@id="siteTable"]//div[@class="entry unvoted"]'
    '/p/a[@data-event-action="title"]/@href'
)