我正在构建一个Python脚本,该脚本基于数据库中提供的用户列表从Instagram收集数据。但是,我在尝试处理意外的JSON响应时遇到了一些问题。
为了提供一些背景信息,该程序正在从我的数据库表中获取用户名(24/7,循环了数百个帐户-因此,while True:
循环),使用该用户名请求URL,并期望JSON响应(特别是在响应中寻找['entry_data']['ProfilePage'][0]
)。
但是,如果在Instagram上找不到用户名,则JSON会有所不同,并且期望的部分(['entry_data']['ProfilePage'][0]
)也就不在其中。所以我的脚本崩溃了。
使用当前代码:
def get_username_from_db():
try:
with connection.cursor() as cursor:
cursor.execute("SELECT * FROM ig_users_raw WHERE `username` IS NOT NULL ORDER BY `ig_users_raw`.`last_checked` ASC LIMIT 1")
row = cursor.fetchall()
username = row[0]['username']
except pymysql.IntegrityError:
print('ERROR: ID already exists in PRIMARY KEY column')
return username
def request_url(url):
try:
response = requests.get(url)
except requests.HTTPError:
raise requests.HTTPError(f'Received non 200 status code from {url}')
except requests.RequestException:
raise requests.RequestException
else:
return response.text
def extract_json_data(url):
try:
r = requests.get(url, headers=headers)
except requests.HTTPError:
raise requests.HTTPError('Received non-200 status code.')
except requests.RequestException:
raise requests.RequestException
else:
print(url)
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
return j
if __name__ == '__main__':
while True:
sleep(randint(5,15))
username = get_username_from_db()
url = f'https://www.instagram.com/{username}/'
j = extract_json_data(url)
json_string = json.dumps(j)
user_id = j['graphql']['user']['id']
username = j['graphql']['user']['username']
#print(user_id)
try:
with connection.cursor() as cursor:
db_data = (json_string, datetime.datetime.now(),user_id)
sql = "UPDATE `ig_users_raw` SET json=%s, last_checked=%s WHERE `user_id`= %s "
cursor.execute(sql, db_data)
connection.commit()
print(f'{datetime.datetime.now()} - data inserted for user: {user_id} - {username}')
except pymysql.Error:
print('ERROR: ', pymysql.Error)
我遇到以下错误/回溯:
https://www.instagram.com/geloria.itunes/
Traceback (most recent call last):
File "D:\Python\Ministry\ig_raw.py", line 63, in <module>
j = extract_json_data(url)
File "D:\Python\Ministry\ig_raw.py", line 55, in extract_json_data
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
理想情况下,我希望它只是跳过该帐户(在这种情况下为geloria.itunes
),然后移至数据库中的下一个帐户。我可能想删除该帐户,或者至少从该行中删除用户名。
为了自己解决这个问题,我尝试了if / else
循环,但是如果这种情况继续下去,我只会在同一个帐户上循环。
您对我如何解决此特定问题有任何建议?
谢谢!
答案 0 :(得分:1)
首先,您需要弄清楚异常发生的原因。
收到此错误的原因是因为您告诉json
解析无效(非JSON)字符串。
仅使用您在追溯中提供的URL运行此示例:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.instagram.com/geloria.itunes/")
print(r.status_code) # outputs 404(!)
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
print(stringified_json)
# j = json.loads(stringified_json) # will raise an exception
输出:
\n(function(){\n function normalizeError(err) {\n... ... stringify(normalizedError));\n })\n }\n })\n}());
您可以看到stringified_json
不是有效的JSON字符串。
正如您提到的那样,这是无效的,因为此instagram页面已隐藏或不存在(HTTP状态代码为404 Not Found
)。而且您将错误的响应传递给json.loads()
,因为您在脚本中没有检查响应状态代码。
以下except
子句未捕获“ 404 case”,因为您收到了有效的HTTP响应,因此没有引发异常的情况:
except requests.HTTPError:
raise requests.HTTPError('Received non-200 status code.')
except requests.RequestException:
raise requests.RequestException
因此,基本上,您有两种方法可以解决此问题:
if r.status_code != 200 ...
raise_for_status()
method引发异常,如果400 <= r.status_code < 600
我可能想删除该帐户,或者至少从该行中删除用户名。
嗯,您的问题听起来有点含糊。我可以提出一个想法。
例如-如果遇到404页,则可以在处理响应时raise
自定义异常,稍后在__main__
中捕获它,从数据库中删除记录并继续其他页面:
class NotFoundError(Exception):
""" my custom exception for not found pages """
pass
... # other functions
def extract_json_data(url):
r = requests.get(url, headers=headers)
if r.status_code == 404:
raise NotFoundError() # page not found
# if any other error occurs (network unavailable for example) - an exception will be raised
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
return json.loads(stringified_json)['entry_data']['ProfilePage'][0]
if __name__ == '__main__':
while True:
sleep(randint(5, 15))
username = get_username_from_db()
url = f'https://www.instagram.com/{username}/'
try:
j = extract_json_data(url)
except NotFoundError:
delete_user_from_db(username) # implement: DELETE FROM t WHERE username = ...
continue # proceed for next user page
# rest of your code:
# json_string = json.dumps(j)
# user_id = j['graphql']['user']['id']
# ...