我有一个具有以下格式的文本文件:
str = '{"controversiality":0,"body":"A look at Vietnam and Mexico liberalisation.","subreddit_id":"t5_6"}{"subreddit_id":"t5_6", "controversiality":0,"body":"I need a cup of water."}';
我想以这种格式从 str
中提取 body 标签:
A look at Vietnam and Mexico liberalisation.
I need a cup of water.
答案 0 :(得分:1)
对于这种格式的数据而不是 .txt 文件,最好使用 JSON 文件格式。
例如:
文件结构:
.
├── data.json
└── parser.py
data.json
{
"data": [
{
"controversiality": 0,
"body": "A look at Vietnam and Mexico liberalisation.",
"subreddit_id": "t5_6"
},
{
"subreddit_id": "t5_6",
"controversiality": 0,
"body": "I need a cup of water."
}
]
}
读取Json数据
parser.py
import json
with open('data.json') as e:
data = json.load(e)
json_data = data['data']
for j in json_data:
if j.get('body'):
print(j['body'])
答案 1 :(得分:0)
包含在 {} 中的字符串的每个部分都是 JSON 格式。 json
库可用于轻松访问数据,将其转换为字典。例如:
import json
post_str = '{"controversiality":0,"body":"A look at Vietnam and Mexico liberalisation.","subreddit_id":"t5_6"}'
post_json = json.loads(post_str)
body = post_json['body']
您的示例有点棘手,因为每个帖子似乎都没有很好地分开,例如,如果每个帖子都在一个新行上,那么逐行处理文件将是微不足道的。理想情况下,您将数据保存为 JSON 文件,正如 Giorgi 所建议的那样,但如果您已经过了数据收集阶段,以下内容将适用于您当前的数据,但有点难看:
import json
str = '{"controversiality":0,"body":"A look at Vietnam and Mexico liberalisation.","subreddit_id":"t5_6"}{"subreddit_id":"t5_6", "controversiality":0,"body":"I need a cup of water."}'
post_bodies = []
for post_str in str.split('}')[:-1]:
post_str += '}'
post = json.loads(post_str)
post_bodies.append(post["body"])
desired_output = " ".join(post_bodies)