美丽的汤刮

时间:2020-09-29 17:13:52

标签: python-3.x python-2.7 web-scraping beautifulsoup

我遇到了旧的工作代码无法正常运行的问题。

我的python代码正在使用漂亮的汤抓取网站并提取事件数据(日期,事件,链接)。

我的代码提取了tbody中的所有事件。每个事件都存储在<tr class="Box">中。问题是,我的抓取工具似乎在此<tr style ="box-shadow: none;>之后停止了工作,直到到达本节(该节包含网站上3个我不想抓取的事件的广告)之后,代码停止提取事件数据从<tr class="Box">内部。有没有办法跳过这种tr样式/忽略以后的情况?

enter image description here

import pandas as pd
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

source = urllib.request.urlopen('https://10times.com/losangeles-us/technology/conferences').read()
soup = bs.BeautifulSoup(source,'html.parser')
   #---Get Event Data---
    test1=[]
    table = soup.find('tbody')
    table_rows = table.find_all('tr') #find table rows (tr)
    for x in table_rows:   
        data = x.find_all('td')  #find table data
        row = [x.text for x in data]
        if len(row) > 2: #Exlcudes rows with only event name/link, but no data.
            test1.append(row)
test1

1 个答案:

答案 0 :(得分:2)

通过JavaScript动态加载数据,因此看不到更多结果。您可以使用此示例加载更多页面:

import requests
from bs4 import BeautifulSoup


url = "https://10times.com/ajax?for=scroll&path=/losangeles-us/technology/conferences"
params = {"page": 1, "ajax": 1}
headers = {"X-Requested-With": "XMLHttpRequest"}

for params["page"] in range(1, 4):  # <-- increase number of pages here
    print("Page {}..".format(params["page"]))
    soup = BeautifulSoup(
        requests.get(url, headers=headers, params=params).content,
        "html.parser",
    )
    for tr in soup.select('tr[class="box"]'):
        tds = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
        print(tds)

打印:

Page 1..
['Tue, 29 Sep - Thu, 01 Oct 2020', 'Lens Los Angeles', 'Intercontinental Los Angeles Downtown, Los Angeles', 'LENS brings together the entire Degreed community - our clients, invited prospective clients, thought leaders, partners, employees, executives, and industry experts for two days of discussion, workshops,...', 'Business Services IT & Technology', 'Interested']
['Wed, 30 Sep - Sat, 03 Oct 2020', 'FinCon', 'Long Beach Convention & Entertainment Center, Long Beach 20.1 Miles from Los Angeles', 'FinCon will be helping financial influencers and brands create better content, reach their audience, and make more money. Collaborate with other influencers who share your passion for making personal finance...', 'Banking & Finance IT & Technology', 'Interested 7 following']
['Mon, 05  - Wed, 07 Oct 2020', 'NetDiligence Cyber Risk Summit', 'Loews Santa Monica Beach Hotel, Santa Monica 14.6 Miles from Los Angeles', 'NetDiligence Cyber Risk Summit will conference are attended by hundreds of cyber risk insurance, legal/regulatory and security/privacy technology leaders from all over the world. Connect with leaders in...', 'IT & Technology', 'Interested']

... etc.