Question

我有一个文件（https://pastebin.com/STgtBRS8），我需要在其中搜索所有出现的单词＆＃34; silencedetect＆＃34;。
然后我必须生成一个JSON文件，其中包含“silence_start”，“silence_end”和“silence_duration”的键值。

JSON文件应如下所示：

[
{
"id": 1,
"silence_start": -0.012381,
"silence_end": 2.2059,
"silence_duration": 2.21828
},
{
"id": 2,
"silence_start": 5.79261,
"silence_end": 6.91955,
"silence_duration": 1.12694,
}
]

这就是我的尝试：

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read().replace('\n', '')

for line in data:
    if "silencedetect" in data:
        #read silence_start, silence_end, and silence_duration and put in json

我无法将3个键值对与每个＆＃34; silencedetect＆＃34;相关联。如何解析键值并以JSON格式获取它们？

Answer 1

你可以正则表达它。

对我有用

import re

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read()

d = re.findall('silence_start: (-?\d+\.\d+)\n.*?\n?\[silencedetect @ \w{14}\] silence_end: (-?\d+\.\d+) \| silence_duration: (-?\d+\.\d+)', data)
print d

你可以通过

把它们放在json中

out = [{'id': i, 'start':a[0], 'end':a[1], 'duration':a[2]} for i, a in enumerate(d)]
import json
print json.dumps(out) # or write to file or... whatever

输出：

'[{"duration": "2.21828", "start": "-0.012381", "end": "2.2059", "id": 0}, {"duration": "1.12694", "start": "5.79261", "end": "6.91955", "id": 1}, {"duration": "0.59288", "start": "8.53256", "end": "9.12544", "id": 2}, {"duration": "1.0805", "start": "9.64712", "end": "10.7276", "id": 3}, {"duration": "1.03406", "start": "12.6657", "end": "13.6998", "id": 4}, {"duration": "0.871519", "start": "19.2602", "end": "20.1317", "id": 5}'

编辑：修复修复了错过某些匹配的错误，因为frame=..行在比赛的开始和结束之间落下

Answer 2

使用re.findall和enumerate函数的复杂解决方案：

import re, json

with open('volume_data.txt', 'r') as f:
    result = []
    pat = re.compile(r'(silence_start: -?\d+\.\d+).+?(silence_end: -?\d+\.\d+).+?(silence_duration: -?\d+\.\d+)')
    silence_items = re.findall(pat, f.read().replace('\n', ''))
    for i,v in enumerate(silence_items):
        d = {'id': i+1}
        d.update({pair[:pair.find(':')]: float(pair[pair.find(':')+2:]) for pair in v})
        result.append(d)

    print(json.dumps(result, indent=4))

输出：

[
    {
        "id": 1,
        "silence_end": 2.2059,
        "silence_duration": 2.21828,
        "silence_start": -0.012381
    },
    {
        "id": 2,
        "silence_end": 6.91955,
        "silence_duration": 1.12694,
        "silence_start": 5.79261
    },
    {
        "id": 3,
        "silence_end": 9.12544,
        "silence_duration": 0.59288,
        "silence_start": 8.53256
    },
    {
        "id": 4,
        "silence_end": 10.7276,
        "silence_duration": 1.0805,
        "silence_start": 9.64712
    },
    {
        "id": 5,
        "silence_end": 13.6998,
        "silence_duration": 1.03406,
        "silence_start": 12.6657
    },
    {
        "id": 6,
        "silence_end": 20.1317,
        "silence_duration": 0.871519,
        "silence_start": 19.2602
    },
    {
        "id": 7,
        "silence_end": 22.4305,
        "silence_duration": 0.801859,
        "silence_start": 21.6286
    },
    ...
]

Answer 3

假设您的数据是有序的，您可以简单地对其进行流式解析，根本不需要正则表达式并加载整个文件：

import json

parsed = []  # a list to hold our parsed values
with open("entries.dat", "r") as f:  # open the file for reading
    current_id = 1  # holds our ID
    entry = None  # holds the current parsed entry
    for line in f:  # ... go through the file line by line
        if line[:14] == "[silencedetect":  # parse the lines starting with [silencedetect
            if entry:  # we already picked up silence_start
                index = line.find("silence_end:")  # find where silence_end starts
                value = line[index + 12:line.find("|", index)].strip()  # the number after it
                entry["silence_end"] = float(value)  # store the silence_end
                # the following step is optional, instead of parsing you can just calculate
                # the silence_duration yourself with:
                # entry["silence_duration"] = entry["silence_end"] - entry["silence_start"]
                index = line.find("silence_duration:")  # find where silence_duration starts
                value = line[index + 17:].strip()  # grab the number after it
                entry["silence_duration"] = float(value)  # store the silence_duration
                # and now that we have everything...
                parsed.append(entry)  # add the entry to our parsed list
                entry = None  # blank out the entry for the next step
            else:  # find silence_start first
                index = line.find("silence_start:")  # find where silence_start, well, starts
                value = line[index + 14:].strip()  # grab the number after it
                entry = {"id": current_id}  # store the current ID...
                entry["silence_start"] = float(value)  # ... and the silence_start
                current_id += 1  # increase our ID value for the next entry

# Now that we have our data, we can easily turn it into JSON and print it out if needed
your_json = json.dumps(parsed, indent=4)  # holds the JSON, pretty-printed
print(your_json)  # let's print it...

你得到：

[
    {
        "silence_end": 2.2059, 
        "silence_duration": 2.21828, 
        "id": 1, 
        "silence_start": -0.012381
    }, 
    {
        "silence_end": 6.91955, 
        "silence_duration": 1.12694, 
        "id": 2, 
        "silence_start": 5.79261
    }, 
    {
        "silence_end": 9.12544, 
        "silence_duration": 0.59288, 
        "id": 3, 
        "silence_start": 8.53256
    }, 
    {
        "silence_end": 10.7276, 
        "silence_duration": 1.0805, 
        "id": 4, 
        "silence_start": 9.64712
    }, 
    # 
    # etc.
    # 
    {
        "silence_end": 795.516, 
        "silence_duration": 0.68576, 
        "id": 189, 
        "silence_start": 794.83
    }
]

请记住，JSON不会订阅数据顺序（v3.5之前也不会订阅Python dict）所以id不一定会出现在第一位，但数据有效性是一样的。

我故意将最初的entry创作分开，因此您可以使用collections.OrderedDict作为替代品（即entry = collections.OrderedDict({"id": current_id})）来保留订单，如果这是您想要的。< / p>

Answer 4

导入重新导入json

以open（'volume_data.csv'，'r'）作为myfile： data = myfile.read（）

matcher = re.compile('(?P<g1>[silencedetect @ \w+?\])\s+?silence_start:\s+?(?P<g2>-?\d+?\.\d+?).*?\n([^\[]+?\n)?(?P=g1)\s+?silence_end:\s+?(?P<g3>-?\d+?\.\d+?).+?\|\s+?silence_duration:\s+?(?P<g4>-?\d+?\.\d+?).*?\n')
matchiter= matcher.findall(data)
#(1) (2)
string=""
for i, matchediter in enumerate( matchiter):
    string+= '{"id": {},\n, "silence_start":{},\n"silence_end": {},\n"silence_duration":{}}'. format(i, matchediter.group(g2),matchediter.group(g3),matchediter.group(g4)).

json.dumps(string)

（1）您可能希望传递一些标记，例如“re.IGNORECASE”，以使您的脚本免受此类更改的影响。

（2）我更喜欢使用非贪婪的序列识别模式，它可能会对识别和速度产生影响。使用命名组是个人品味的问题。如果您决定使用matcher.sub操作立即重新格式化read（），而不是使用迭代来重建文件文本，则可以使用它们。如果你想不通，我可以添加替换字符串。另外我更喜欢使用.group的匹配对象，它是为此而制作的，可以使用您选择的名称而不是g1，g2，g3，g4。

总的来说，我更喜欢使用finditer，因为它基本上是为这种操作而制作的，findall会产生捕获组的元组，这很好但你有时可能想要使用相对于完全匹配，模式，位置索引的infos文字等。

编辑：我将正则表达式设置为在持续时间数字之后添加的任何字符串以及多个空格。我还考虑了插入的线条，您可以根据需要通过命名组来捕获它们。它捕获189次发生，有190次“沉默开始”，但最后一次没有结束和持续时间信息。

为所有出现的字符串解析文件并在JSON中生成键值

4 个答案: