使用无效字符解码JSON

时间:2014-05-30 11:25:27

标签: python json encoding

我有一个从外部服务接收数据的服务(通过用作队列的redis列表)。数据只是一个扁平的JSON编码字典,示例可能如下所示:

{
  "type": "visit",
  "referer": "http://www.google.com/",
  "session_referer": "http://www.google.com/\x0e",
  "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
  "user_ip": "1.2.3.4",
  "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
  "user_locale": "en_US",
}

问题在于,正如您在上面的示例中所看到的,有时引用者或session_referrer具有无效数据(无法使用我期望的任何编码进行解码,例如UTF-8,ISO-8859 -1等。)。

我的问题是我无法访问任何其他数据。我可以忍受引用者搞砸的事实,但我仍然需要其他数据。有没有办法做一个&#34; raw&#34;解码而不将数据转换成任何特定的编码,然后让我从那里处理它?<​​/ p>

2 个答案:

答案 0 :(得分:2)

给定一个文本文件,其中包含类似JSON的“string”和

  1. “session_referer”值中的十六进制0E字节,以及
  2. 最后一个键/值对后面的虚假逗号:
  3. npp.png

    以下Python代码消除了麻烦的值......

    # -*- coding: iso-8859-1 -*-
    import json
    import re
    
    # retrieve the JSON data into a string
    f = open(r'C:\Users\Gord\Desktop\jsonData.txt', 'r')
    s = f.read()
    f.close()
    print '~> raw JSON string'
    print s
    print
    
    # remove "characters" below \x20 except \n
    s = re.sub(r'[\000-\011\013-\037]', '', s)
    # remove (extraneous) last comma
    s = re.sub(',\n}$', '\n}', s)
    print '~> tweaked JSON string'
    print s
    print
    
    # decode tweaked JSON string
    j = json.loads(s)
    
    # see what we got
    print '~> decoded result "pretty printed"'
    print json.dumps(j, sort_keys=True, indent=4, separators=(',', ': '))
    print
    
    # extract just one element
    print '~> print just j["user_ip"]'
    print j["user_ip"]
    

    ...并在Python IDLE shell中生成以下结果:

    Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>> ================================ RESTART ================================
    >>> 
    ~> raw JSON string
    {
      "type": "visit",
      "referer": "http://www.google.com/",
      "session_referer": "http://www.google.com/♫",
      "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
      "user_ip": "1.2.3.4",
      "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
      "user_locale": "en_US",
    }
    
    ~> tweaked JSON string
    {
      "type": "visit",
      "referer": "http://www.google.com/",
      "session_referer": "http://www.google.com/",
      "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
      "user_ip": "1.2.3.4",
      "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
      "user_locale": "en_US"
    }
    
    ~> decoded result "pretty printed"
    {
        "referer": "http://www.google.com/",
        "session_referer": "http://www.google.com/",
        "type": "visit",
        "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
        "user_ip": "1.2.3.4",
        "user_locale": "en_US",
        "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97"
    }
    
    ~> print just j["user_ip"]
    1.2.3.4
    >>> 
    

答案 1 :(得分:1)

您可以尝试设置strict = false,它允许字符串中的控制字符。

https://docs.python.org/2/library/json.html