为了好玩,我试图在python中编写一个脚本,该脚本遍历给定subreddit首页上的所有帖子。我有以下代码:
from lxml import html
import requests
subredditURL = "https://www.reddit.com/r/" + "pics/"
subredditPage = requests.get(subredditURL)
subredditTree = html.fromstring(subredditPage.content)
subreddit_rows_xpath = subredditTree.xpath('//*[@id="siteTable"]')
for div in subreddit_rows_xpath:
print(div)
现在我认为for循环会打印出与我正在查看的页面上的帖子一样多的div。我认为对于典型的reddit subreddit的首页,这将是25个帖子。我认为这会起作用的原因是当我手动检查siteTable div时,它似乎包含一系列25个带有x_paths的div,其格式如下:siteTable div:
//*[@id="thing_t3_63fuuy"]
其中id似乎是一个随机字符串,并且首页上的每个帖子都有一个这样的div,它们包含我可以探索的帖子的相关信息。
上面的代码不是打印出25个div而是返回:
<Element div at 0x110669f70>
只暗示一个div,而不是我预期的25。我怎么会出错呢?
以下是我正在探索的网址链接,如果有帮助:https://www.reddit.com/r/pics/
答案 0 :(得分:1)
The expression subredditTree.xpath('//*[@id="siteTable"]')
returns a list with only 1 element. So iterating over it using:
for div in subreddit_rows_xpath:
print(div)
only outputs 1 element, because that's all that exists. If you want to iterate over all of the div
elements under subreddit_rows_xpath
, you can use:
subreddit_table_divs = subredditTree.xpath('//*[@id="siteTable"]//div')
for div in subreddit_table_divs:
print(div)
However, I am guessing you want more than just a bunch of lines that look like <Element div at 0x99999999999>
. You probably want the either the title or the link to the posts.
To get the titles, you need to drill down two levels to the links:
subreddit_titles = subredditTree.xpath(
'//*[@id="siteTable"]//div[@class="entry unvoted"]'
'/p/a[@data-event-action="title"]/text()'
)
To get the links to the images, it is the same path, just grab the href
attribute.
subreddit_links = subredditTree.xpath(
'//*[@id="siteTable"]//div[@class="entry unvoted"]'
'/p/a[@data-event-action="title"]/@href'
)