查询使用beautifulsoup抓取多个页面

时间:2018-05-24 02:25:15

标签: python-3.x web-scraping beautifulsoup screen-scraping

我正在尝试使用网站链接中的beautifulsoup来抓取页面 - https://concreteplayground.com/auckland/events。 我能够从第1页中提取所有内容。当我想在下一页上移动时,我无法找到任何链接/解析下一页的参考。我尝试检查页面,当我检查移动到第2页时找到内容如下 -

<a rel="nofollow" class="page-numbers" href="">2</a>

我不确定如何处理这种类型的网页。如果有人可以帮我解决这个问题会很棒。下一页内容正在被提取并显示在同一个网址中。不确定发生了什么背景也是。 谢谢&amp;此致

1 个答案:

答案 0 :(得分:0)

对此前的垃圾回答感到抱歉。我全都发现了selenium的点击功能哈哈。无论如何,你想要的页面是Ajax很重,需要一种不同的方法,而不是传统的HTML抓取。请参阅以下链接,详细了解您必须通​​过以下链接处理的网址:Handling Ajax。基本上,运行一个脚本,允许分页而不更改主URL。

以下是我尝试实现您要求的输出。如果有人发现改进它以简化它,我将非常感激。

#Import essentials
import requests
from bs4 import BeautifulSoup


#Not necessary, but always useful just in case
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}


#Read url, parse using BeautifulSoup, and dynamically find no of pages
temp_page = requests.get('https://concreteplayground.com/auckland/events', headers=headers)
soup = BeautifulSoup (temp_page.content, 'html.parser')
PgNos = len(soup.findAll('li', {'class':'page'}))


#Now for the interesting part!

#Form the url to which requests are to be sent. This url is used to GET every json response which I've later parsed and printed. This url is available in the network tab of developer tools of your browser (like Firebug)
for i in range(PgNos+1):
    u = 'https://concreteplayground.com/ajax.php?post_type=tribe_events&place_type=event&region=auckland&sort=all&paged='
    r = str(i)
    l = '&action=directory_search&user_lat=&user_lon='
    url = u+r+l
    response = requests.get(url, headers=headers)
    data = response.json()

    #Now,iterate through the main body of the json to get what you want
    for each in data['results']:

        event_name = each['post_title']
        event_excerpt = each['post_excerpt']

        #There's a li'l HTML bit here, so you ought'a use BS to parse that. 
        rdata = each['info']
        raw = BeautifulSoup(rdata, 'lxml')
        date = raw.p.text
        rawvenue = raw.findAll('span', {'itemprop':'name'})
        venuename = rawvenue[0].text
        venueaddress = rawvenue[0].meta['content']

        #Obviously, you can also write to a file in lieu of the below. 
        print ('Event : ' + event_name + '\n' + 'Excerpt : ' + event_excerpt + '\n' +'Date : ' +  date + '\n' + 'Venue : ' + venuename + '\n' + 'Address : ' + venueaddress + '\n\n')

在重构我的答案时,这些来源也很有用:GET and POST explanationJSON iteration