基于python中的模式从文本文件中提取文本

时间:2021-04-20 10:49:56

标签: python python-3.x

我有一个具有以下格式的文本文件:

str = '{"controversiality":0,"body":"A look at Vietnam and Mexico liberalisation.","subreddit_id":"t5_6"}{"subreddit_id":"t5_6", "controversiality":0,"body":"I need a cup of water."}';

我想以这种格式从 str 中提取 body 标签:

A look at Vietnam and Mexico liberalisation.
I need a cup of water.

2 个答案:

答案 0 :(得分:1)

对于这种格式的数据而不是 .txt 文件,最好使用 JSON 文件格式。 例如:
文件结构:

.
├── data.json
└── parser.py

data.json

{
    "data": [
        {
            "controversiality": 0,
            "body": "A look at Vietnam and Mexico liberalisation.",
            "subreddit_id": "t5_6"
        },
        {
            "subreddit_id": "t5_6",
            "controversiality": 0,
            "body": "I need a cup of water."
        }
    ]
}

读取Json数据
parser.py

import json

with open('data.json') as e:
    data = json.load(e)

json_data = data['data']
for j in json_data:
    if j.get('body'):
        print(j['body'])

答案 1 :(得分:0)

包含在 {} 中的字符串的每个部分都是 JSON 格式。 json 库可用于轻松访问数据,将其转换为字典。例如:

import json
post_str = '{"controversiality":0,"body":"A look at Vietnam and Mexico liberalisation.","subreddit_id":"t5_6"}'
post_json = json.loads(post_str)
body = post_json['body']

您的示例有点棘手,因为每个帖子似乎都没有很好地分开,例如,如果每个帖子都在一个新行上,那么逐行处理文件将是微不足道的。理想情况下,您将数据保存为 JSON 文件,正如 Giorgi 所建议的那样,但如果您已经过了数据收集阶段,以下内容将适用于您当前的数据,但有点难看:

import json
str = '{"controversiality":0,"body":"A look at Vietnam and Mexico liberalisation.","subreddit_id":"t5_6"}{"subreddit_id":"t5_6", "controversiality":0,"body":"I need a cup of water."}'
post_bodies = []
for post_str in str.split('}')[:-1]:
    post_str += '}'
    post = json.loads(post_str)
    post_bodies.append(post["body"])
desired_output = " ".join(post_bodies)
相关问题