Python网页抓取 - 几秒后访问HTML?

时间:2018-01-03 12:44:52

标签: python html python-3.x python-requests

我正在使用Python访问此网站并抓取HTML:http://forum.toribash.com/tori_spy.php

如您所见,如果您访问该网页,则内容会在几秒钟内发生变化。这是一个页面,显示论坛上的最新帖子,我正在制作一个能够显示最新帖子的Discord机器人。

现在,它会显示该列表中的第一篇帖子 之前的任何动画/更改。

我想知道是否有办法让我跳过动画或让程序在访问之后等待几秒钟才能抓取所有HTML。

当前代码:

    if message.content.startswith("-post"):
        await client.send_message(message.channel, ":arrows_counterclockwise: **Accessing forums...**")
        await client.send_typing(message.channel)
        time.sleep(5)
        #access site
        session_requests = requests.session()
        url = "http://forum.toribash.com/tori_spy.php"
        result = session_requests.get(url,headers = dict(referer = url))
        #access html
        tree = html.fromstring(result.content)

        list_stuff=[]
        for atag in tree.xpath("//strong/a"): #search for <strong><a>
            list_stuff.append(atag.text_content())
        await client.send_message(message.channel, ":white_check_mark: Last post was in the thread **"+list_stuff[0]+"**")

非常感谢!

1 个答案:

答案 0 :(得分:0)

网页使用ajax / xhr加载新帖子。它使用这样的URL

forum.toribash.com/vaispy.php?do=xml&last=9297850&r=0....

last是最后一条消息的ID,您可以在HTML中找到该ID 某个highestid = 9297850;标记中的<script>r似乎并不重要 - 至少代码在没有r的情况下适用于我。

获得highestid后,您可以使用它来获取XML最新消息。

XML中,您可以将其ID显示为<postid>,以便在下次请求中使用它。

import requests
from lxml import html

s = requests.session()

result = s.get("http://forum.toribash.com/tori_spy.php")
tree = html.fromstring(result.content)

for script in tree.xpath("//script"):
    if script.text and 'highestid' in script.text:
        highestid = script.text.split('\n')[3]
        highestid = highestid[13:-1]
        print('highestid:', highestid)

        result = s.get('http://forum.toribash.com/vaispy.php?do=xml&last='+highestid, headers=dict(referer=url))
        #print(result.text)
        data = html.fromstring(result.content)

        for item in data.xpath('.//event'):
            print('--- event ---')
            print('id:', item.xpath('.//id')[0].text)
            print('postid:', item.xpath('.//postid')[0].text)
            print(item.xpath('.//preview')[0].text)

当前结果(您的结果可能不同)

highestid: 9297873
--- event ---
id: 9297883
postid: 9297883
me vende esse full valkyrie por 18k
--- event ---
id: 9297881
postid: 9297881
Congratz Goat! Welcome to the team! :)
--- event ---
id: 9297879
postid: 9297879
Try to reset your email password, then attempt to do what I suggested.
--- event ---
id: 9297877
postid: 9297877
Hello Nope. Most of these bugs are known to currently cause issues and they are being worked on. People pinging and rejoining are bots that are being dealt with (it's just an extensive process to...
--- event ---
id: 9297874
postid: 9297874
Bon courage :)