我是python的新手,正在构建一个webscraper。 我想要整个网页上第二个“跨度”的所有实例。我的目的是获取所有汽车品牌名称(例如:日产)和汽车型号名称(例如:Pathfinder)
但是我不知道如何掌握所有的汽车模型。我已经尝试过建立索引,但是无法建立一个给出所有模型名称的循环。
下面是我要从中获取名称的页面html。
<h3 class="brandModelTitle">
<span class="txtGrey3">NISSAN</span>
<span class="txtGrey3">PATHFINDER</span>
<span class="version txtGrey7C noBold">(2)
2.5 DCI 190 LE 7PL EURO5</span>
</h3>
下面是我用来查找所有品牌名称的代码 名称= []
Prices_Cars = []
for var1 in soup.find_all('h3', class_ = 'brandModelTitle'):
brand_Names = var1.span.text
Names.append(brand_Names)
答案 0 :(得分:0)
soup.find_all('h3', class_ = 'brandModelTitle')
仅返回h3,您应拦截每个h3以查找所有跨度。
尝试一下:
from bs4 import BeautifulSoup
str = """
<h3 class="brandModelTitle">
<span class="txtGrey3">NISSAN</span>
<span class="txtGrey3">PATHFINDER</span>
<span class="version txtGrey7C noBold">(2)
2.5 DCI 190 LE 7PL EURO5</span>
</h3>
"""
soup = BeautifulSoup(str,'html5lib')
result = []
for var1 in soup.find_all('h3', class_ = 'brandModelTitle'):
dic = {}
spans = var1.find_all('span', class_ = 'txtGrey3')
dic["Brands"]=spans[0].get_text()
dic["model"]=spans[1].get_text()
result.append(dic)
答案 1 :(得分:0)
您可以使用scrapy,我只包括parse函数部分:
def parse(self, response):
#Remove XML namespaces
response.selector.remove_namespaces()
#Extract article information
brands = response.xpath('//h3/span[1]/text()').extract()
models = response.xpath('//h3/span[2]/text()').extract()
details = response.xpath('//h3/span[3]/text()').extract()
for item in zip(brands,models,details):
scraped_info = {
'brand' : item[0],
'model' : item[1],
'details' : item[2]
}
yield scraped_info
草率信息:https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/ xpath示例:https://www.w3schools.com/xml/xpath_examples.asp