如何打印和显示Web刮板的所有结果?

时间:2020-02-17 02:46:24

标签: python web-scraping beautifulsoup

import requests
from bs4 import BeautifulSoup

URL = ""
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='simple-view')

events_elems = results.find_all('ul', class_='searchResults')

for event_elem in events_elems:

    date_elem = event_elem.find('li', class_='date-indicator')
    location_elem = event_elem.find('div', class_='text--labelSecondary')
    e_elem = event_elem.find('a', class_='event')
    if None in (date_elem,location_elem, e_elem):
        continue
    print(date_elem.text)
    print(location_elem.text)
    print(e_elem.text)

我刚刚开始使用python web抓取功能,尝试使用上面的代码在metup.com上进行抓取,但是只显示了一组结果,在迭代部分做错了吗?

1 个答案:

答案 0 :(得分:1)

您使用的.find_all

events_elems = results.find_all('ul', class_='searchResults')

没有捕获网站中的每一行,即您需要加强搜索条件。

您使用的event_elem.find('li', class_='date-indicator')也不足够,因为它没有记录每个事件的日期。


请参阅以下工作代码,该代码通过事件列表的容器捕获结果集:

import requests
from bs4 import BeautifulSoup

URL = "https://www.meetup.com/find/events/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='simple-view')

event_container = results.find_all('ul', class_='event-listing-container')[0]
events_elems = event_container.find_all(class_= 'event-listing')    

for event_elem in events_elems:

    location_elem = event_elem.find('div', class_='text--labelSecondary')
    e_elem = event_elem.find('a', class_='event')    
    date = "{}-{}-{} {}".format(
        event_elem.attrs['data-year'],
        event_elem.attrs['data-month'],
        event_elem.attrs['data-day'],
        event_elem.find('time').text.replace('\n', ''),
    )

    print(date)
    print(location_elem.text)
    print(e_elem.text)
    print('-----')

示例输出为

2020-2-17 9:00AM


Architecting for Innovation



Australasian Enterprise Architecture Summer School 2020

-----
2020-2-17 5:00PM


Sydney Indoor Rock Climbers



Monday and Thursday Night Climbing @ St Peters (Beginners Welcome)

-----
2020-2-17 5:30PM

......
......