具有单引号和双引号以及收缩符的Python解析文件

时间:2018-08-01 13:56:53

标签: python json parsing text-mining

我正在尝试分析一个文件,其中某些行可能包含单引号,双引号和紧缩的组合。每个观察结果都包含一个字符串,如上所述。尝试解析数据时,尝试解析评论时遇到问题。例如:

\'text\' : \'This is the first time I've tried really "fancy food" at a...\' 

\'text\' : \'I' be happy to go back "next hollidy"\' 

1 个答案:

答案 0 :(得分:0)

使用简单的双替换对字符串进行预处理-首先将所有引号转义,然后将所有转义的引号替换为引号-这将简单地将转义取反,例如:

# we'll define it as an object to keep the validity
src = "{\\'text\\' : \\'This is the first time I've tried really \"fancy food\" at a...\\'}"
# The double escapes are just so we can type it properly in Python.
# It's still the same underneath:
# {\'text\' : \'This is the first time I've tried really "fancy food" at a...\'}

preprocessed = src.replace("\"", "\\\"").replace("\\'", "\"")
# Now it looks like:
# {"text" : "This is the first time I've tried really \"fancy food\" at a..."}

现在这是有效的JSON(顺便说一下,还有Python字典),因此您可以继续对其进行解析:

import json

parsed = json.loads(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}

或者:

import ast

parsed = ast.literal_eval(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}

更新

根据发布的行,您实际上具有一个7元素元组的(有效)表示形式,其中包含字典的字符串表示形式作为其第三元素,您根本不需要预处理字符串。您需要先评估元组,然后再对内部dict进行另一级评估,即:

import ast

# lets first read the data from a 'input.txt' file so we don't have to manually escape it
with open("input.txt", "r") as f:
    data = f.read()

data = ast.literal_eval(data)  # first evaluate the main structure
data = data[:2] + (ast.literal_eval(data[2]), ) + data[3:]  # .. and then the inner dict

# this gives you `data` containing your 'serialized' tuple, i.e.:
print(data[4])  # 31.328237,-85.811893
# and you can access the children of the inner dict as well, i.e.:
print(data[2]["types"])  # ['restaurant', 'food', 'point_of_interest', 'establishment']
print(data[2]["opening_hours"]["weekday_text"][3])  # Thursday: 7:00 AM – 9:00 PM
# etc.

话虽这么说,我建议追踪那些正在生成这种数据的人,并说服他们使用某种适当的序列化形式,即使最基本的JSON也会比这更好。