Question

我正在构建一个Python脚本，该脚本基于数据库中提供的用户列表从Instagram收集数据。但是，我在尝试处理意外的JSON响应时遇到了一些问题。

为了提供一些背景信息，该程序正在从我的数据库表中获取用户名（24/7，循环了数百个帐户-因此，while True:循环），使用该用户名请求URL，并期望JSON响应（特别是在响应中寻找['entry_data']['ProfilePage'][0]）。但是，如果在Instagram上找不到用户名，则JSON会有所不同，并且期望的部分（['entry_data']['ProfilePage'][0]）也就不在其中。所以我的脚本崩溃了。

使用当前代码：

def get_username_from_db():
    try:
        with connection.cursor() as cursor:
            cursor.execute("SELECT * FROM ig_users_raw WHERE `username` IS NOT NULL ORDER BY `ig_users_raw`.`last_checked` ASC LIMIT 1")
            row = cursor.fetchall()
            username = row[0]['username']
    except pymysql.IntegrityError:
        print('ERROR: ID already exists in PRIMARY KEY column')
    return username

def request_url(url):
    try:
        response = requests.get(url)
    except requests.HTTPError:
        raise requests.HTTPError(f'Received non 200 status code from {url}')
    except requests.RequestException:
        raise requests.RequestException
    else:
        return response.text

def extract_json_data(url):
    try:
        r = requests.get(url, headers=headers)
    except requests.HTTPError:
        raise requests.HTTPError('Received non-200 status code.')
    except requests.RequestException:
        raise requests.RequestException
    else:
        print(url)
        soup = BeautifulSoup(r.content, "html.parser")
        scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
        stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
        j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
        return j

if __name__ == '__main__':
    while True:
        sleep(randint(5,15))
        username = get_username_from_db()
        url = f'https://www.instagram.com/{username}/'
        j = extract_json_data(url)
        json_string = json.dumps(j)
        user_id = j['graphql']['user']['id']
        username = j['graphql']['user']['username']
        #print(user_id)
        try:
            with connection.cursor() as cursor:
                db_data = (json_string, datetime.datetime.now(),user_id)
                sql = "UPDATE `ig_users_raw` SET json=%s, last_checked=%s WHERE `user_id`= %s "
                cursor.execute(sql, db_data)
                connection.commit()
                print(f'{datetime.datetime.now()} - data inserted for user: {user_id} - {username}')
        except pymysql.Error:
            print('ERROR: ', pymysql.Error)

我遇到以下错误/回溯：

https://www.instagram.com/geloria.itunes/
Traceback (most recent call last):
  File "D:\Python\Ministry\ig_raw.py", line 63, in <module>
    j = extract_json_data(url)
  File "D:\Python\Ministry\ig_raw.py", line 55, in extract_json_data
    j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
  File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

理想情况下，我希望它只是跳过该帐户（在这种情况下为geloria.itunes），然后移至数据库中的下一个帐户。我可能想删除该帐户，或者至少从该行中删除用户名。

为了自己解决这个问题，我尝试了if / else循环，但是如果这种情况继续下去，我只会在同一个帐户上循环。

您对我如何解决此特定问题有任何建议？

谢谢！

Answer 1

首先，您需要弄清楚异常发生的原因。

收到此错误的原因是因为您告诉json解析无效（非JSON）字符串。

仅使用您在追溯中提供的URL运行此示例：

import re
import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.instagram.com/geloria.itunes/")
print(r.status_code)  # outputs 404(!)

soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]

print(stringified_json)
# j = json.loads(stringified_json)  # will raise an exception

输出：

\n(function(){\n function normalizeError(err) {\n... ... stringify(normalizedError));\n })\n }\n })\n}());

您可以看到stringified_json不是有效的JSON字符串。

正如您提到的那样，这是无效的，因为此instagram页面已隐藏或不存在（HTTP状态代码为404 Not Found）。而且您将错误的响应传递给json.loads()，因为您在脚本中没有检查响应状态代码。

以下except子句未捕获“ 404 case”，因为您收到了有效的HTTP响应，因此没有引发异常的情况：

except requests.HTTPError:
    raise requests.HTTPError('Received non-200 status code.')
except requests.RequestException:
    raise requests.RequestException

因此，基本上，您有两种方法可以解决此问题：

手动检查响应HTTP状态代码，例如if r.status_code != 200 ...
或使用raise_for_status() method引发异常，如果400 <= r.status_code < 600

我可能想删除该帐户，或者至少从该行中删除用户名。

嗯，您的问题听起来有点含糊。我可以提出一个想法。

例如-如果遇到404页，则可以在处理响应时raise自定义异常，稍后在__main__中捕获它，从数据库中删除记录并继续其他页面：

class NotFoundError(Exception):
    """ my custom exception for not found pages """
    pass

...  # other functions

def extract_json_data(url):
    r = requests.get(url, headers=headers)
    if r.status_code == 404:
        raise NotFoundError()  # page not found

    # if any other error occurs (network unavailable for example) - an exception will be raised

    soup = BeautifulSoup(r.content, "html.parser")
    scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
    stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
    return json.loads(stringified_json)['entry_data']['ProfilePage'][0]

if __name__ == '__main__':
    while True:
        sleep(randint(5, 15))
        username = get_username_from_db()
        url = f'https://www.instagram.com/{username}/'
        try:
            j = extract_json_data(url)
        except NotFoundError:
            delete_user_from_db(username)  # implement: DELETE FROM t WHERE username = ...
            continue  # proceed for next user page

        # rest of your code:
        # json_string = json.dumps(j)
        # user_id = j['graphql']['user']['id']
        # ...

如何在while循环中处理意外的json响应

1 个答案: