我正在尝试进行一些实验性的网络爬网,并询问在以下情况下是否有可能克服ValueError。作为示例,我想对以下5个数据字段进行网络抓取:
1) Car Model: Honda Fit Auto 1.3
2) Price: S$19,000
3) Date post: 3 weeks ago by back_packer
4) Depreciation: S$8,362.75
5) Registration Date: 15 Jan 2010
在网站的html中,2)到5)的数据位于同一标签下
<p class="cU-b cU-d">3 weeks ago by <a href="/back_packer" rel="nofollow "
target="_blank">back_packer</a></p>
<p class="cU-b cU-d">S$19,000</p>
<p class="cU-b cU-d">S$8,362.75</p>
<p class="cU-b cU-d">15 Jan 2010</p>
def getHTML(link, counter):
return bs(get(link.format(counter)).content, "html.parser")
PAGE_URL = 'https://sg.carousell.com/categories/cars-32/cars-for-sale-1173/'
CAR_URL = 'https://sg.carousell.com/p/{}'
car = dict()
content = getHTML(CAR_URL, car_id).find('div', {'class': 'aG-c aG-b'})
car['Model'] = content.find('p', {'class': 'cU-b cU-e'}).text
car['Post'], car['Price'], car['Deprec'], car['Regstr_Date'] = {info.text for
info in content.find_all('p', {'class': 'cU-b cU-d'})}
====================================
当我尝试运行时,我会遇到“ ValueError:没有足够的值要解压(预期3,得到2)”。我怀疑该错误是由至少一项汽车记录引起的,该记录中缺少邮寄,价格,折旧或登记日期的字段。
谢谢。
答案 0 :(得分:0)
该页面不是十分友好,因此您可以使用反复试验方法,直到获得正确的结果。我的尝试是在这里(我用"-"
替换了缺失的值,至少它不会引发ValueError,但是您需要检查它是否刮取了正确的信息):
from bs4 import BeautifulSoup as bs
from requests import get
import re
from pprint import pprint
def getHTML(link, counter):
return bs(get(link.format(counter)).content, "html.parser")
PAGE_URL = 'https://sg.carousell.com/categories/cars-32/cars-for-sale-1173/'
CAR_URL = 'https://sg.carousell.com/p/{}'
# car_id = 'mazda-3-sedan-auto-1-5-182030279'
car_id = 'nissan-nv200-1-5-manual-182141686'
# car_id = 'toyota-corolla-axio-1-5-auto-x-177344405'
car = {}
content = getHTML(CAR_URL, car_id).find('div', {'class': 'aG-c aG-b'})
car['Model'] = content.find('p', {'class': 'cU-b cU-e'}).text
data = []
for p in content.select('section.bi-c.bi-h p.cU-b.cU-d')[:4]:
if re.match(r'\d+\s+Likes', p.text):
break
data.append(p.text)
car['Post'], car['Price'], car['Deprec'], car['Regstr_Date'], *_ = data + ['-'] * 4
# swap Deprec and Registration Date?
if car['Deprec'] != '-' and '$' not in car['Deprec']:
car['Regstr_Date'], car['Deprec'] = car['Deprec'], car['Regstr_Date']
pprint(car)
这辆车的照片:
{'Deprec': '-',
'Model': 'Nissan NV200 1.5 Manual',
'Post': 'an hour ago by rubberr',
'Price': 'S$25,800',
'Regstr_Date': '-'}