网页抓取-内容未显示在页面源中

时间:2020-08-04 14:32:36

标签: python selenium web-scraping

我正在尝试从以下网站抓取信息:https://foreclosures.cabarruscounty.us/。所有数据似乎都是在重复卡中生成的,但是当我查看页面源时我找不到该信息。我曾尝试使用Web驱动程序(例如Selenium),但仍然无法看到我要抓取的内容。我希望能够提取每个条目的所有重复数据。

driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)

url = 'https://foreclosures.cabarruscounty.us/'

driver.get(url)

web_url = driver.page_source
soup = bs.BeautifulSoup(web_url, 'html.parser')
print(soup)

我如何能够访问或查看重复卡本身中的内容?

2 个答案:

答案 0 :(得分:2)

您看到的数据是从外部URL加载的,您只能使用requests模块来获取它:

import json
import requests


url = 'https://foreclosures.cabarruscounty.us/dataForeclosures.json'
data = requests.get(url).json()

# uncomment this to see all data:
# print(json.dumps(data, indent=4)

# print some data to screen:
for d in data:
    for k, v in d.items():
        print('{:<5}: {}'.format(k, v))
    print('-' * 80)

打印:

ID   : 2062
TM   : 04-086 -0010.00
S    : COMPLAINT/JUDGMENT
C    : 20-CVD-1754
R    : 56235032510000
T    : 14,850
O    : W O L INC A NC CORPORATION
M    : 3,703
SD   : PENDING
ST   : PENDING
D    : S/S DALE EARNHARDT BLVD
A    : ZACCHAEUS LEGAL SVCS
CO   : www.zls-nc.com
SL   : 77 UNION ST S CONCORD NC 28025
SP   : COURTHOUSE STEPS
U    : https://foreclosures.cabarruscounty.us/PropertyPhotos/2062.jpg
OR   : 3
--------------------------------------------------------------------------------
ID   : 2061
TM   : 04-007 -0006.00
S    : COMPLAINT/JUDGMENT
C    : 20-CVD-1070
R    : 56036654730000
T    : 135,190
O    : PITTS H M PITTS H M ESTATE
M    : 9,475
SD   : PENDING
ST   : PENDING
D    : SOUTH SIDE MOORESVILLE RD
A    : ZACCHAEUS LEGAL SVCS
CO   : www.zls-nc.com
SL   : 77 UNION ST S CONCORD NC 28025
SP   : COURTHOUSE STEPS
U    : https://foreclosures.cabarruscounty.us/PropertyPhotos/2061.jpg
OR   : 3
--------------------------------------------------------------------------------

...and so on.

答案 1 :(得分:1)

尝试一下:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://foreclosures.cabarruscounty.us/")

all_cards = driver.find_elements_by_xpath("//div[@class='card-body']/div[1]")
for card in all_cards:
    print(card.text) #do as you will 

xpath获取带有文本内容的卡片。我的devtools说有174种: devtools

一个简单的过程就是让它们全部循环然后遍历它们。

我已经完成打印,但是您可以按照自己的意愿做。

这是我得到的输出:(因为数量很多,只是前几个)

DevTools listening on ws://127.0.0.1:51331/devtools/browser/555e3584-d777-4c8b-b928-cb8159173533
Real ID: 11-045 -0010.40
Status: UPSET BID PERIOD
Case Number: 18-CVD-2687
Tax Value: $71,500
Min Bid: $9,394
Sale Date: 12/05/2019
Sale Time: 12:00 PM
Owner: PACAJERO REALTY LLC
Attorney: ZACCHAEUS LEGAL SVCS
Real ID: 01-021 -0014.70
Status: UPSET BID PERIOD
Case Number: 16-CVD-3713
Tax Value: $21,360
Min Bid: $5,965
Sale Date: 02/20/2020
Sale Time: 12:00 PM
Owner: HOOKS JOHNNY DALE JR...
Attorney: ZACCHAEUS LEGAL SVCS
Real ID: 11-045 -0017.00
Status: UPSET BID PERIOD
Case Number: 18-CVD-2687
Tax Value: $370,670
Min Bid: $39,187
Sale Date: 12/05/2019
Sale Time: 12:00 PM
Owner: PACAJERO REALTY LLC
Attorney: ZACCHAEUS LEGAL SVCS