这是我对Python和BeautifulSoup的首次介绍。我正在尝试从热门拍卖网站(RealInsight)上列出的特定属性中抓取当前出价,但是我无法让BeautifulSoup提取我正在寻找的实际整数,只能提取HTML代码。我正在寻找“ s-b-n”类别标签的价值,在拍卖真正开始之前,它的价值为325万美元。
https://marketplace.realinsight.com/sales/details/XXX
我认为这是因为该值是动态更新的,并且是在HTML代码之外生成的,但是我不确定如何验证该论文或在证明正确的情况下获取该值。我也认为我可能没有正确地引用包含该值的表,但是同样,我对python或bs4的使用也不是很熟练。
[使用ewwink的方法在下面用最终代码更新-每5秒抓取一次]-更新到拍卖结束-
import bs4
import time
import csv
import datetime
import sys
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = 'https://marketplace.realinsight.com/sales/details/XXX'
uclient = uReq(my_url)
page_html = uclient.read()
uclient.close()
endmsg = "Auction End"
page_soup = soup(page_html, "html.parser")
propname = page_soup.title.text
bids = page_soup.select_one(".body-content")
currentbid = bids['data-nb']
bidincrement = bids['data-bi']
with open(propname + '_bids.csv','w',newline='') as f:
thewriter = csv.writer(f)
thewriter.writerow(['i','prop_name','date_time','bid_increment','bid_amt'])
for i in range(0,5,1):
try:
import sys
sale = page_soup.select_one("div.sale-end-text")
auctionend = sale.text.replace(" ", "")
if auctionend == sale.text.replace(" ", ""):
currentDT = datetime.datetime.now()
thewriter.writerow([i,endmsg,currentDT,currentbid])
print(endmsg,currentbid)
time.sleep(1)
sys.exit()
else:
print('will never get to this point')
except Exception:
pass
currentDT = datetime.datetime.now()
thewriter.writerow([i,propname,currentDT,bidincrement,currentbid])
print(i,propname,currentDT,bidincrement,currentbid)
time.sleep(1)
使用chitown88的方法更新
import bs4
import datetime
import time
import csv
import selenium
from selenium import webdriver
driver = webdriver.Chrome(executable_path='C:\\Users\\XXXX\\Downloads\\chromedriver_win32\\chromedriver.exe')
driver.get('https://marketplace.realinsight.com/sales/details/XXX')
html = driver.page_source
page_soup = bs4.BeautifulSoup(html,"html.parser")
bids = page_soup.select("td.s-b-n")
propname = page_soup.title.text
currentbid = bids[0].text
with open(propname + '_bids.csv','w',newline='') as f:
thewriter = csv.writer(f)
thewriter.writerow(['i','prop_name','date_time','bid_amt'])
for i in range(0, 5, 1):
currentDT = datetime.datetime.now()
driver.refresh()
thewriter.writerow([i, propname, currentDT, currentbid])
print(i, propname, currentDT, currentbid)
time.sleep(1)
driver.close()
我可以在HTML代码中看到所需的数字(3,250,000美元),但是它每隔几秒钟闪烁并更新一次,这就是为什么我认为它是在其他地方生成的。
任何指导将不胜感激。
答案 0 :(得分:1)
您可以使用BeautifulSoup,data-sb
中有div.body-content
个属性,用于存储出价值。
page_soup = soup(page_html, "html.parser")
bids = page_soup.select_one(".body-content")
print(bids['data-sb'])
# format the number
print('${:,d}'.format(int(float(bids['data-sb']))))
print(bids.attrs)
答案 1 :(得分:0)
我无法让BeautifulSoup给我数据,但是我通过Selenium进行管理。您必须已安装chromedriver以及Selenium,可以通过在控制台中键入以下内容来完成此操作:
pip install selenium
这是脚本:
from selenium import webdriver
from selenium.webdriver.common.by import By
pageLink = 'https://marketplace.realinsight.com/sales/details/367'
# Setup our chrome preferences.
chromeOptions = webdriver.ChromeOptions()
# Change this variable to the path of the chromedriver you downloaded.
chromedriver = "D:\Downloads\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome( executable_path = chromedriver, chrome_options = chromeOptions )
driver.get( pageLink )
extractData = driver.find_element( By.XPATH, "/html/body/div[3]/section[2]/div/div[1]/div[2]/div[1]/div[2]/div/div[1]/div/table/tbody/tr[2]/td[2]" )
print( extractData.text )
答案 2 :(得分:0)
解析之前,您需要加载页面。 Selenium是完美的选择。
import bs4
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://marketplace.realinsight.com/sales/details/367')
html = driver.page_source
page_soup = bs4.BeautifulSoup(html,"html.parser")
bids = page_soup.select("td.s-b-n")
bid = bids[0].text
print(bid)
driver.close()
和输出:
In [91]: print(bid)
$3,250,000