Question

我正在尝试使用python抓取我的messenger.com（facebook messenger）聊天，我使用谷歌chromes开发人员工具查看聊天记录的POST请求，我已将整个标题和正文复制为请求的格式可以使用。

我得到HTTP代码200暗示请求至少得到了某些东西但我可以print res.encoding获得它返回的编码，其中声称是utf-8。但我无法解码它！

这是功能：

def download_thread(self, limit, offset, message_timestamp):
    """Download the specified number of messages from the
    provided thread, with an optional offset
    """
    data = request_data(self.thread, offset=offset,
                        limit=limit, group=self.group,
                        timestamp=message_timestamp)

    res = self.ses.post(url_thread, data=data, headers=headers)

    print(res.content)

    thread_contents = json.loads(res.content)
    print(thread_contents)
    return thread_contents

产量

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

尝试json.load（或loads）数据时

但res.encoding确实返回utf-8。

我尝试使用gzip解压缩但是说它不是gzip压缩的内容。

如果我只是尝试print(res.content)我得到

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
0f\x82\x048\xbb\xb9=\x87\xebK0.\xff\x90\xdd\xeb\xfa\x16\xc6\xbbz\x8b\x82)\xe8\xaaV\x01^\xda\x8b\xbd\x15d-\xb1\x10@\x17\\\xd43\xa8\x92w\xe8\xc0\xcdU\xc4\xff\xc7\xfa\x90\xb2\xb3\xf5\x84\x11u\x0b\t\x8f\x83r\xf3}\xe5!y$\xe6\xf6c0\xf0\xb4\x98\xcat_\x0c\x08\xb5\xdd\x8ctx\x91\xa9\x95\rB%\xe2\x93\xa52\x85_\xa6\x10\xc2\xc9\xa3\xee4SDb\xa5\x18QJ\x83X\x19)\xaa$\xf4\xb4\xb7\x0b\x84\x15&\x88\x08L\xc9iP\xa2\xb9\xf2\xaf\x96\x96N\xd8\xcf=\x05\xc1\x18\x8d\xa0\xf2Y\x8e\n\xcf\xc8\x0fE4\xd6)\xa1\xd4\xb7D\xd6{i\xc8P\x96R\x11HC\xac\xbcKyT#~}\x93\xf7@K\xc7r/\x82\xb0\xe4\xefX\xf9j\x08\xa6Hp\xfcn\x06\xfdo\x9a\xd0wJ\xb4fJ(\x89+\x1c\xf6\x0eOI\x90\xac\x9eDD\xfd,\xa5\xe9\x89\x1blh\x86Z\x98\x05\xdd9\xc7\xf4\x80\xfcY\x8e\xad\xee\x99!\x15\x13+\x9b\x07\xe8Fdj\xfc\x11\xfc\xfe7\x06h\x02\x00@>]W\x92\xc9\x02\xb1c3\x82\xcd\xa4\xefN9\x90\xe6\x81y\x9c\x84er\xd4\xc3\x06\x1c\x06\x14\xcf\xc7\x07hj\xbfH\xdc\xf5~\xf7z\x18Ce\xaf^\x8c\xab \xdfV\xce\xb8\x11\xf8\x06\x03'

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

奇怪的是，在追溯过程中打印内容让我觉得有一些看不见的字符将其推倒。

我无法将响应加载到json格式中，因为无论我如何处理响应内容，它都没有正确格式化以供json库解释。

此外，如果我只是做print(res.text)我会得到垃圾：

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
}sP���c���f�u0���\� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z�\��l4����t=i��%Ҳu�x��%�x�
       F    <���{1i�#%;�rɲ=Rχm��1B�Z(+�(S-���#��\v�{b��
                                                           �    f/V�i̴��_��83�  �_����*��O��
                                                                                            ������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^׷0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a�   "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^
�       �g���SWQHӳ.��BӄG�,����@E��������
                                        nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ�

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

编辑：

MWE尽我所能，不确定我的帖子请求中的哪些数据是私有的，所以我留下了一些

使用此数据

url_thread = "https://www.messenger.com/api/graphqlbatch/"


request_data = {
  "batch_name": "MessengerGraphQLThreadFetcher",
  "__user": "<user_id>",
  "__a": "1",
  "__dyn": "<dyn>",
  "__req": "9",
  '__be'      : '-1',
  '__pc'      : 'PHASED:messengerdotcom_pkg',
  "fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw",
  "ttstamp": "265817254666710077746711957586581715370521181008510710777",
  "__rev": "3791607",
  "jazoest": "<jazoest>",
  "queries": '<queries>'
  }

headers = {
  "authority": "www.messenger.com",
  "method": "POST",
  "path": "/api/graphqlbatch/",
  "scheme": "https",
  "accept": "*/*",
  "accept-encoding": "gzip, deflate, br",
  "accept-language": "en-US,en;q=0.9",
  "cache-control": "no-cache",
  "content-length": "754",
  "content-type" : "application/x-www-form-urlencoded",
  "cookie": "<cookies>",
  "origin": "https://www.messenger.com",
  "pragma": "no-cache",
  "referer": "https://www.messenger.com/t/<chatID>",
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}

您可以使用Chrome开发人员工具获取所有<items>，并在网络标签上查看对Request URL: https://www.messenger.com/api/graphqlbatch/的POST请求。

如果您在Chrome开发工具正在录制时向上滚动以重新加载旧邮件，则很容易找到。

然后用python

组合一个简单的请求

import requests as rq
import time

ses = rq.Session()
thread = <ID of thread found in URL of messenger.com>

conversation_type = <'thread_fbids' if group chat else 'user_ids'>

data = request_data
data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0
data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time())
data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000

res = ses.post(url_thread, data=data, headers=headers)

print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)

正如我的开发工具回来的那样，你可以看到json here

的开头

Answer 1

问题是请求标题中的这一行：

"accept-encoding": "gzip, deflate, br",

br请求Brotli compression，一种新的压缩标准（请参阅RFC 7932），谷歌正在推动取代网络上的gzip。 Chrome正在要求Brotli，因为最新版本的Chrome本身就能理解它。您要求Brotli，因为您从Chrome复制了标题。但requests本身并不了解Brotli。

您可以pip install brotli注册解压缩程序，或者只需在res.content上手动调用它。但更简单的解决方案是删除br：

"accept-encoding": "gzip, deflate",

...然后你应该得到gzip，你和requests已经知道如何处理。

Python请求以utf-8编码的响应但无法解码

1 个答案: