我正在关注网络抓取教程,以便现在学习网页抓取。我一步一步地按照说明操作,但出现了错误。
JSON对象必须是str,bytes或bytearray,而不是'dict'
我搜索了这个错误,大多数解决方案都是编码或解码,但是它对我的代码不起作用。当我使用编码或解码时,它给了我另一个错误,“'dict'对象没有属性'解码'”
我在pycharm上使用Python 2.7.10。
这是错误:
Traceback (most recent call last):
File "/Users/junjielin/PycharmProjects/meituan/index.py", line 46, in <module>
main(url)
File "/Users/junjielin/PycharmProjects/meituan/index.py", line 42, in main
shop_id_list = get_detail_id(category_url, headers=headers)
File "/Users/junjielin/PycharmProjects/meituan/index.py", line 35, in get_detail_id
return json.loads(content_id.get('data')).get('poiidList')
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 348, in loads
'not {!r}'.format(s.__class__.__name__))
TypeError: the JSON object must be str, bytes or bytearray, not 'dict'
以下是我的代码:
# —*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup #解析网页,获取标签内容
import json
import lxml
url = 'http://taishan.meituan.com/'
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language':'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Host':'taishan.meituan.com',
'Referer':'http://taishan.meituan.com/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
'Content-Type':'text/html; charset=UTF-8',
}
def get_start_links(url):
html = requests.get(url).text
#print (html)
soup = BeautifulSoup(html, 'lxml')
links = [link.find('div').find('div').find('dl').find('dt').find('a')['href'] for link in soup.find_all('div', class_='J-nav-item')]
#print (links)
return links
def get_detail_id(url, headers=None):
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# loads将json转dict
content_id = json.loads(soup.find('div',class_='J-scrollloader cf J-hub')['data-async-params'])
# a = content_id.get('data')
return json.loads(content_id.get('data')).get('poiidList')
def main(url):
start_url_list = get_start_links(url)
for j in start_url_list:
for i in range(1, 11):
category_url = j+'/all/page{}'.format(i)
shop_id_list = get_detail_id(category_url, headers=headers)
print (shop_id_list)
if __name__ == '__main__':
main(url)
感谢您的快速回复。根据您的信息,我在return语句中删除了对json.loads的调用,该语句变为:
return content_id.get('data').get('poiidList')
然而,它给了我一个新的错误:
Traceback (most recent call last):
File "/Users/junjielin/PycharmProjects/meituan/index.py", line 46, in <module>
main(url)
File "/Users/junjielin/PycharmProjects/meituan/index.py", line 42, in main
shop_id_list = get_detail_id(category_url, headers=headers)
File "/Users/junjielin/PycharmProjects/meituan/index.py", line 35, in get_detail_id
return content_id.get('data').get('poiidList')
AttributeError: 'str' object has no attribute 'get'
我不明白这意味着什么。