网页抓取 CSS SELECTOR 不返回任何内容

时间:2021-04-30 16:14:41

标签: python-3.x selenium

我正在尝试提取一些棒球比赛的门票价格/信息,但每次尝试获取数据时都会出错...知道什么会导致价格、位置和细节出现这些问题吗?我也试过 XPATH 没有成功

games = ['https://seatgeek.com/dodgers-at-cubs-tickets/5-3-2021-chicago-illinois-wrigley-field/mlb/5316872', \
        'https://seatgeek.com/dodgers-at-cubs-tickets/5-5-2021-chicago-illinois-wrigley-field/mlb/5316885']

#gather ticket data
urls = []
location = []
prices = []
details = []

for g in games:
    try:
        driver.get(g)
        price = [i.text for i in WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.Button__ButtonContents')))]
        print(price)
        loc = [i.text for i in WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.ListingTicket__Section')))]
        print(loc)
        detail = [i.text for i in WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.ListingTicket__Availability')))]
        print(detail)
        url = [str(g)] * len(price)
        urls.extend(url)
        prices.extend(price)
        location.extend(loc)
        details.extend(detail)
        print(str(g) + ": " + len(price) + " ")
    except:
        print('Failed: ' + str(g))
        pass
import requests
import pandas as pd

driver.get('https://seatgeek.com/chicago-cubs-tickets')
gameIds = [i.get_attribute('href') for i in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.EventItem__ItemLink-sc-14845pu-6')))]
gameIds = [x[-7:] for x in gameIds]

url = 'https://seatgeek.com/rescraper/v2/listings'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
writer = pd.ExcelWriter(final, engine='xlsxwriter')

tables = []
for gameId in gameIds:
    payload = {
    '_include_seats': '1',
    'client_id': 'MTY2MnwxMzgzMzIwMTU4',
    'id': '%s' %gameId,
    'sixpack_client_id': '93d1ab10-07dc-4482-bb89-b87c2b144e33'}
    
    jsonData = requests.get(url, headers=headers, params=payload).json()
    df = pd.json_normalize(jsonData['listings'])
    df.to_excel(writer, sheet_name=gameId)
    tables.append(df)
    print(gameId)

table = pd.concat(tables)

writer = pd.ExcelWriter(final, engine='xlsxwriter')
table.to_excel(writer, sheet_name='Tickets')
writer.save()
print('Done')

新错误:

HTTPSConnectionPool(host='seatgeek.com', port=443): Max retries exceeded with url: /rescraper/v2/listings?
_include_seats=1&client_id=MTY2MnwxMzgzMzIwMTU4&id=5316872&sixpack_client_id=93d1ab10-07dc-4482-bb89-b87c2b144e33 
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

2 个答案:

答案 0 :(得分:1)

您可以将这些用于这些元素:

price = [i.text for i in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-test='event-listing']//a/span")))]
price = [x.replace('\n', '') for x in price] #added to get rid of newline character in each list element
print(price)
loc = [i.text for i in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-test='event-listing']//div[@data-test='section']")))]
print(loc)
detail = [i.text for i in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-test='event-listing']//span[@data-test='quantity']")))]
print(detail)


['$26/ea', '$112/ea', '$27/ea', '$122/ea', '$101/ea', '$88/ea', '$35/ea', '$38/ea']
['424 Right · Row 6', 'Section 113 · Row 1', '420 Right · Row 9', 'Section 114 · Row 1', 'Section 109 · Row 3', 'Section 110 · Row 13', '421 Right · Row 7', '421 Right · Row 6']
['2 tickets', '4 tickets', '2 tickets', '4 tickets', '4 tickets', '4 tickets', '2 tickets', '2 tickets']
...

我为 price 添加了另一个列表推导式以去除每个字符串中出现的换行符

您还需要一个修复:

改变这个:

print(str(g) + ": " + len(price) + " ")

为此:

print(str(g) + ": " + str(len(price)) + " ")

答案 1 :(得分:1)

只需从 api 中获取该数据。只要你有那个身份证号码。您可能需要破译列的含义,但似乎很容易。您可能还想添加游戏的日期,否则所有数据都在那里:

import requests
import pandas as pd

gameIds = [5316872, 5316885] 

url = 'https://seatgeek.com/rescraper/v2/listings'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}

tables = []
for gameId in gameIds:
    payload = {
    '_include_seats': '1',
    'client_id': 'MTY2MnwxMzgzMzIwMTU4',
    'id': '%s' %gameId,
    'sixpack_client_id': '93d1ab10-07dc-4482-bb89-b87c2b144e33'}
    
    jsonData = requests.get(url, headers=headers, params=payload).json()
    df = pd.json_normalize(jsonData['listings'])
    tables.append(df)

输出:

这是第一个表格(仅显示前 5 行),但第一个表格中有 265 行。另一个是 455。

print(tables[0].head(10).to_string())
           dm    ep  et       f                                  gk      gr           id         ihd  dl  h  lv  vp                              mk         m  pu       p      pf  q  rp   r      rf  rr          ss     sdq  sgp     sgf    sif                        s                       sf   sr  sh    sco   sp spt      st  wc  sro  dq.b  dq.dq dq.ddq   dq.ev                                                                                                       d                                    fi   sg   sd
0  electronic  True   1    2.00          budweiser bleachers 515_19   85202   y5EMUx5j6Y               0  0   0   0  s:budweiser-bleachers-515 r:19  exchange   0   82.57   84.57  2   4  19  Row 19  19        None      []   64   20.57  False  budweiser bleachers 515  Budweiser Bleachers 515  515   0  False  [2]         pdf   0    0     1  74.62    7.8  146.55                                                                                                     NaN                                   NaN  NaN  NaN
1  electronic  True   1  209.00                       121_5_111:112  895002  kYetLw0ZN64  2021-05-02   0  0   0   0                       s:121 r:5  exchange   0  686.00  895.00  2   4   5   Row 5   5  [111, 112]  [5, 5]  686  209.00  False                      121              Section 121  121   0  False  [2]      mobile   0    0     5  15.73    2.1  433.27  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  9645a1de-66df-49b5-b637-5fa5c4736c41  NaN  NaN
2  electronic  True   1  156.45  budweiser bleachers 502_11_111:112  663002  lxVsqxleK85  2021-05-02   0  0   0   0  s:budweiser-bleachers-502 r:11  exchange   0  506.00  662.45  2   4  11  Row 11  11  [111, 112]  [6, 6]  506  156.45  False  budweiser bleachers 502  Budweiser Bleachers 502  502   0  False  [2]      mobile   0    0     6   2.84    0.5  117.89  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS                                   NaN  NaN  NaN
3  electronic  True   1  148.75                      129_13_111:112  631002  kYetLw0ZN2A  2021-05-02   0  0   0   0                      s:129 r:13  exchange   0  482.00  630.75  2   4  13  Row 13  13  [111, 112]  [6, 6]  482  148.75  False                      129              Section 129  129   0  False  [2]      mobile   0    0     6   4.63    0.7  166.99  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  f2d511b1-7b7f-4d84-b628-966fee6e8109  NaN  NaN
4  electronic  True   1  164.16                      218_10_111:112  695002  w3JsqE3VkKz  2021-05-02   0  0   0   0                      s:218 r:10  exchange   0  530.00  694.16  2   4  10  Row 10  10  [111, 112]  [6, 6]  530  164.16  False                      218              Section 218  218   0  False  [2]      mobile   0    0     6   3.56    0.6  166.48  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  a4904c72-fcc2-4342-b214-3283268cbbab  NaN  NaN
5  electronic  True   1  156.45                      218_15_111:112  663002  NrqUJbEl0YM  2021-05-02   0  0   0   0                      s:218 r:15  exchange   0  506.00  662.45  2   4  15  Row 15  15  [111, 112]  [6, 6]  506  156.45  False                      218              Section 218  218   0  False  [2]      mobile   0    0     6   3.70    0.6  155.66  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  a4904c72-fcc2-4342-b214-3283268cbbab  NaN  NaN
6  electronic  True   1  147.17                       229_9_111:112  621002  qVjH7eqn6jB  2021-05-02   0  0   0   0                       s:229 r:9  exchange   0  473.00  620.17  2   4   9   Row 9   9  [111, 112]  [6, 6]  473  147.17  False                      229              Section 229  229   0  False  [2]      mobile   0    0     6   2.73    0.4   77.54  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  4481eab0-396d-4696-bf67-950e33b45c5d  NaN  NaN
7  electronic  True   1  139.45                      229_13_111:112  589002  rVOH8wD9EP2  2021-05-02   0  0   0   0                      s:229 r:13  exchange   0  449.00  588.45  2   4  13  Row 13  13  [111, 112]  [6, 6]  449  139.45  False                      229              Section 229  229   0  False  [2]      mobile   0    0     6   3.01    0.5   74.55  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  4481eab0-396d-4696-bf67-950e33b45c5d  NaN  NaN
8  electronic  True   1  132.75                      229_17_111:112  557002  jDvsErZMO59  2021-05-02   0  0   0   0                      s:229 r:17  exchange   0  424.00  556.75  2   4  17  Row 17  17  [111, 112]  [6, 6]  424  132.75  False                      229              Section 229  229   0  False  [2]      mobile   0    0     6   3.33    0.5   71.82  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  4481eab0-396d-4696-bf67-950e33b45c5d  NaN  NaN
9  electronic  True   1  148.75                      218_20_111:112  631002  3q7fvGgbAwB  2021-05-02   0  0   0   0                      s:218 r:20  exchange   0  482.00  630.75  2   4  20  Row 20  20  [111, 112]  [6, 6]  482  148.75  False                      218              Section 218  218   0  False  [2]      mobile   0    0     6   3.90    0.6  145.81  TMX XFER MOBILE ENTRY. Scan your tickets from your mobile phone for this event. MOBILE ENTRY NO SPLITS  a4904c72-fcc2-4342-b214-3283268cbbab  NaN  NaN