如何使用Python找到正确的“ div”进行网页抓取

时间:2019-05-28 09:08:37

标签: python html

我似乎无法“检查”正确的功能以使美丽的汤发挥作用。我正在尝试遵循这些指南,但我似乎无法超越这一点。

https://www.youtube.com/watch?v=XQgXKtPSzUI&t=119s Web scraping with Python

我正在尝试通过网站抓取一个网站,以比较四款车辆的安全功能,维护成本和价格点。我正在使用spyder(python 3.6)

import bs4
from urllib import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.edmunds.com/car-comparisons/? 
veh1=401768437&veh2=401753723&veh3=401780798&veh4=401768504'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

#grabs each product
containers = page_soup.findAll("div", {"class":"feature-field-value"})

filename = "car_comparison.csv"
f = open(filename, "W")

Headers = "Title, product_name, shipping\n"
f.write(headers)

for container in containers:
brand = container.div.div.a.img["title"]

title_container = container.findAll("a", {"class":"vehicle-title f16-nm f14 
lh22-nm lh18 pv2 sserif-2 tr-text minh90"})
product_name = title_container[0].text


shipping_container = container.findAll("li",{"class":"price-ship"})
shipping_container[0].text.strip()

print("Title: " + title)
print("product_name: " + product_name)
print("shipping: " + shipping)

f.write( brand + "," +product_name.replace(",", "|") + "," + shipping + 
"\n")

f.close()

#Criteria 1 
#safety = Warranty, Basic?
#Maintence Cost = Maintence
#Price = Base MSRP

我知道我必须进行很多更改,但是现在我只希望它运行而不会出错


runfile('C:/Users/st.s.mahathirath.ctr/.spyder-py3/temp.py',wdir ='C:/Users/st.s.mahathirath.ctr/.spyder-py3' ) 追溯(最近一次通话):

文件“”,第1行,在     runfile('C:/Users/st.s.mahathirath.ctr/.spyder-py3/temp.py',wdir ='C:/Users/st.s.mahathirath.ctr/.spyder-py3')

runfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder \ utils \ site \ sitecustomize.py”,第705行     execfile(文件名,命名空间)

execfile中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ spyder \ utils \ site \ sitecustomize.py”,第102行     exec(compile(f.read(),文件名,'exec'),命名空间)

文件“ C:/Users/st.s.mahathirath.ctr/.spyder-py3/temp.py”,第2行,在     从urllib导入urlopen作为uReq

ImportError:无法导入名称“ urlopen”

2 个答案:

答案 0 :(得分:0)

ifconfig usb0 192.168.225.3 netmask 255.255.255.0

ndc network create 9792

ndc network interface add 9792 usb0

ndc network route add 9792 usb0 0.0.0.0/0 192.168.225.1

ndc resolver setnetdns 9792 8.8.8.8 8.8.4.4

ndc network default set 9792

这看起来像是我的错字?也许你是说:

  

从urllib.request导入urlopen作为uReq

答案 1 :(得分:0)

尝试以下代码:

import bs4
from urllib import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.caranddriver.com/car-comparison-tool?chromeIDs=404121,402727,403989,403148'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div", {
    "class": "w50p "
})

我认为您正在尝试进行汽车比较,恕我直言,这可能不起作用,因为

  • 这将无法在命令行中使用,因为该网站将抛出不受支持的浏览器;
  • 在比较站点中的汽车(其中太多)之后,
  • 恕我直言,div不是要查找的元素。在调试器控制台中尝试document.getElementsByTagName('cd-view-car-card')来查看4个项目(最后一个项目是“添加汽车”项目)。在此cd-view-car-card的内部,有一个带有2个子级的div,第二个子级(div)包含所有相关信息(根据当前网站设计)。

希望这会有所帮助