Reg Expression减少巨大的文件大小?

时间:2018-04-05 11:52:21

标签: json regex vba extract large-data

我有一系列 gigantic (40-80mb)导出的Google位置记录JSON文件,我的任务是分析选择的活动数据。不幸的是,谷歌在download site没有任何参数或选项可以选择除"一个包含永远 "的巨型JSON之外的任何内容。 (KML选项是两倍大。)

明显的选择,如JSON-Converter (laexcel-test incarnation of VBA-JSON);用VBA逐行解析;甚至Notepad++。他们都崩溃并烧毁。我认为 RegEx 可能就是答案。

  1. This Python script可以在两秒内从40mb文件中提取时间戳和位置(使用RegEx?)。 Python如何快速完成?(在VBA中会如此快吗?)

  2. 我能够一点一点地提取我需要的一切,只要我有一个RegEx的魔法块,或许有这样的逻辑:

    • 删除所有内容 除外:
      timestampMsWALKING出现在同一组[ 方括号 ]之间时:

      • 我需要timestampMS之后的13位数字,
      • WALKING后面的一到三位数字。
  3. 如果包含更多数据更简单,例如" 所有时间戳"或" 所有活动& #34;,我以后可以轻松地筛选它。 我的目标是让文件足够小,以便我可以操作它而不需要rent a supercomputer,大声笑。

    我尝试调整现有的RegEx'但我对RegEx和乐器都有严重的问题:我没有努力尝试,我只是无法包装我的绕过它。所以,这确实是一个"请为我编写代码"问题,但它只是一个表达式,我今天通过为其他人编写代码来支付它!谢谢......

      }, {
        "timestampMs" : "1515564666086",    ◁― (Don't need this but it won't hurt)
        "latitudeE7" : -6857630899, 
        "longitudeE7" : -1779694452999,
        "activity" : [ {
          "timestampMs" : "1515564665992",  ◁― EXAMPLE: I want only this, and...
          "activity" : [ {
            "type" : "STILL",
            "confidence" : 65
          }, {                                              ↓
            "type" : "TILTING",
            "confidence" : 4
          }, {
            "type" : "IN_RAIL_VEHICLE",
            "confidence" : 20                               ↓
          }, {
            "type" : "IN_ROAD_VEHICLE",
            "confidence" : 5
          }, {
            "type" : "ON_FOOT",                             ↓
            "confidence" : 3
          }, {
            "type" : "UNKNOWN",
            "confidence" : 3
          }, {
            "type" : "WALKING",             ◁―┬━━ ...AND, I also want this.
            "confidence" : 3                ◁―┘
          } ]
        } ]
      }, {
        "timestampMs" : "1515564662594",    ◁― (Don't need this but it won't hurt)
        "latitudeE7" : -6857630899, 
        "longitudeE7" : -1779694452999,
        "altitude" : 42
      }, {
    

    编辑:

    出于测试目的,我制作了一份样本文件,代表原件(尺寸除外)。原始JSON可以直接从this Pastebin link下载,作为本地副本下载,this TinyUpload link复制为"一个长线"下面:

    {"locations" : [ {"timestampMs" : "1515565441334","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 2299}, {"timestampMs" : "1515565288606","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42,"activity" : [ {"timestampMs" : "1515565288515","activity" : [ {"type" : "STILL","confidence" : 98}, {"type" : "ON_FOOT","confidence" : 1}, {"type" : "UNKNOWN","confidence" : 1}, {"type" : "WALKING","confidence" : 1} ]} ]}, {"timestampMs" : "1515565285131","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42}, {"timestampMs" : "1513511490011","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511369962","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511179720","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513511059677","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510928842","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510942911","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]}, {"timestampMs" : "1513510913776","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 15,"altitude" : -11,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513507320258","activity" : [ {"type" : "TILTING","confidence" : 100} ]} ]}, {"timestampMs" : "1513510898735","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510874140","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 19,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510874245","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]} ]}
    

    该文件经过JSONLintFreeFormatter验证为有效。

3 个答案:

答案 0 :(得分:2)

  

明显的选择......

这里显而易见的选择是一个可以快速处理大文件的JSON感知工具。在下文中,我将使用jq,只要有足够的RAM将文件保存在内存中,它就可以轻松快速处理千兆字节大小的文件,即使没有足够的文件,它也可以处理非常大的文件RAM将JSON保存在内存中。

首先,我们假设该文件由所显示形式的JSON对象数组组成,目标是为每个可允许的子对象提取两个值。

这是一个可以胜任的jq程序:

.[].activity[]
| .timestampMs as $ts
| .activity[]
| select(.type == "WALKING")
| [$ts, .confidence]

对于给定的输入,这将产生:

["1515564665992",3]

更具体地说,假设上面的程序位于名为program.jq的文件中,并且输入文件是input.json,则适当的jq调用如下:

jq -cf program.jq input.json

应该很容易修改上面给出的jq程序来处理其他情况,例如:如果JSON模式比上面假设的更复杂。例如,如果架构中存在某些不规则性,请尝试使用某些后缀?,例如:

.[].activity[]?
| .timestampMs as $ts
| .activity[]?
| select(.type? == "WALKING")
| [$ts, .confidence]

答案 1 :(得分:1)

你可以试试这个

(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$

Regex Demo ,,,其中我通过“timestamp value, walking value”,“longitude”,“activity”等关键字搜索并接近目标捕捉值([timestampMs“,”]“,”walking“,”confidence“,”ss=""" copy & paste the file contents' strings (above sample text) in this area """ regx= re.compile(r"(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$") matching= regx.match(ss) # method 1 : using match() function's capturing group timestamp= matching.group(1) walkingval= matching.group(2) print("\ntimestamp is %s\nwalking value is %s" %(timestamp,walkingval)) print("\n"+regx.sub(r'\1 \2',ss)) # another method by using sub() function “。

Python脚本

timestamp is 1515564665992
walking value is 3

1515564665992 3

输出

p ["AB_1020", "AB_950", "AB_50", "1000", "570"].partition{|x| x.to_i.zero? }
      .flat_map{|x| x.sort_by {|x|x[/d+/]}.reverse}

答案 2 :(得分:1)

不幸的是,您似乎选择了没有高性能JSON解析器的语言。

使用Python,您可以:

#!/usr/bin/env python3
import time
import json

def get_history(filename):
    with open(filename) as history_file:
        return json.load(history_file)

def walking_confidence(history):
    for location in history["locations"]:
        if "activity" not in location:
            continue

        for outer_activity in location["activity"]:
            confidence = extract_walking_confidence(outer_activity)
            if confidence:
                timestampMs = int(outer_activity["timestampMs"])
                yield (timestampMs, confidence)

def extract_walking_confidence(outer_activity):
    for inner_activity in outer_activity["activity"]:
        if inner_activity["type"] == "WALKING":
            return inner_activity["confidence"]

if __name__ == "__main__":
    start = time.clock()
    history = get_history("history.json")

    middle = time.clock()
    wcs = list(walking_confidence(history))

    end = time.clock()
    print("load json: " + str(middle - start) + "s")
    print("loop json: " + str(end - middle) + "s")

在我的98MB JSON历史记录中打印出来:

  

加载json:3.10292s
  循环json:0.338841s

这不是非常高效,但肯定不错。