如何将非结构化数据文件提取到json对象上

时间:2019-05-09 07:12:33

标签: python json string text

在这里需要一些建议。我有一个文本文件,其中包含一些需要提取并另存为JSON文件的信息。该文件在块中是非结构化的。请在下面找到:

我该如何实现?我只是不知道如何开始。 我有找到类型:路由器的想法,但是如何在每个块上进行迭代,而仅选择P-2-P块详细信息。感谢您的建议。

Type      : Router
  Ls id     : 1.1.1.2
  Adv rtr   : 1.1.1.2  
  Ls age    : 201 
  Len       : 84   
  Link count: 5
   * Link ID: 1.1.1.2    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.4    
     Data   : 192.168.100.34  
     Link Type: P-2-P        
     Metric : 1
   * Link ID: 192.168.100.33  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.1    
     Data   : 192.168.100.53  
     Link Type: P-2-P        
     Metric : 1
   * Link ID: 192.168.100.54  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium

  Type      : Router
  Ls id     : 1.1.1.1
  Adv rtr   : 1.1.1.1  
  Ls age    : 1699 
  Len       : 96 
  Options   :  ASBR  E  
  seq#      : 80008d72 
  chksum    : 0x16fc
  Link count: 6
   * Link ID: 1.1.1.1    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.1    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 12 
     Priority : Medium
   * Link ID: 1.1.1.3    
     Data   : 192.168.100.26  
     Link Type: P-2-P        
     Metric : 10
   * Link ID: 192.168.100.25  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 10 
     Priority : Medium
   * Link ID: 1.1.1.2    
     Data   : 192.168.100.54  
     Link Type: P-2-P        
     Metric : 10
   * Link ID: 192.168.100.53  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 10 
     Priority : Medium

仅提取具有以下类型的每个块:路由器。在此块中,要捕获的信息是:

(1)Ls id  : 1.1.1.2
and under link count, info to capture is block that only have link type:P-2-P
(a)Link ID: 1.1.1.4   
(b)Data   : 192.168.100.34 

(c)Link Type: P-2-P 

(d)Metric : 1

(a)Link ID: 1.1.1.3    
(b)Data   : 192.168.100.53  
(c)Link Type: P-2-P    
(d)Metric : 1

Then for another Type: Router block. To capture
(2)Ls id  : 1.1.1.1
and under link count, info to capture is block that only have link type:P-2-P
(a)Link ID: 1.1.1.3   
(b)Data   : 192.168.100.26 
(c)Link Type: P-2-P 
(d)Metric : 10

(a)Link ID: 1.1.1.2    
(b)Data   : 192.168.100.54  
(c)Link Type: P-2-P    
(d)Metric : 10

**There is another Link Type (StubNet) but the only interested to capture is block that have Link Type:P-2-P**

在JSON中预期如下:

{
  "oppf": [
    {
      "Sid": "1.1.1.2",
      "Did": "1.1.1.4",
      "Sport": " 192.168.100.34",
      "Netype": "P-2-P",
      "Metric": "1"
    },
    {
      "Sid": "1.1.1.2",
      "Did": "1.1.1.1",
      "Sport": " 192.168.100.53",
      "Netype": "P-2-P",
      "Metric": "1"
    },
    {
      "Sid": "1.1.1.1",
      "Did": "1.1.1.3",
      "Sport": " 192.168.100.26",
      "Netype": "P-2-P",
      "Metric": "10"
    },
    {
      "Sid": "1.1.1.1",
      "Did": "1.1.1.2",
      "Sport": " 192.168.100.54",
      "Netype": "P-2-P",
      "Metric": "10"
    }
   ],
}

2 个答案:

答案 0 :(得分:1)

对我来说,它的结构很好。它具有不同的缩进来识别子项,*可以识别新字典的开始,而空行则可以识别新的路线。它还具有:来拆分行并获取键和值。

data = '''  Type      : Router
  Ls id     : 1.1.1.2
  Adv rtr   : 1.1.1.2  
  Ls age    : 201 
  Len       : 84   
  Link count: 5
   * Link ID: 1.1.1.2    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.4    
     Data   : 192.168.100.34  
     Link Type: P-2-P        
     Metric : 1
   * Link ID: 192.168.100.33  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.1    
     Data   : 192.168.100.53  
     Link Type: P-2-P        
     Metric : 1
   * Link ID: 192.168.100.54  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium

  Type      : Router
  Ls id     : 1.1.1.1
  Adv rtr   : 1.1.1.1  
  Ls age    : 1699 
  Len       : 96 
  Options   :  ASBR  E  
  seq#      : 80008d72 
  chksum    : 0x16fc
  Link count: 6
   * Link ID: 1.1.1.1    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 1 
     Priority : Medium
   * Link ID: 1.1.1.1    
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 12 
     Priority : Medium
   * Link ID: 1.1.1.3    
     Data   : 192.168.100.26  
     Link Type: P-2-P        
     Metric : 10
   * Link ID: 192.168.100.25  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 10 
     Priority : Medium
   * Link ID: 1.1.1.2    
     Data   : 192.168.100.54  
     Link Type: P-2-P        
     Metric : 10
   * Link ID: 192.168.100.53  
     Data   : 255.255.255.255 
     Link Type: StubNet      
     Metric : 10 
     Priority : Medium'''

results = []
group = {}
group['items'] = []
subgroup = None

for line in data.split('\n'):
    if not line.strip():
        results.append(group)
        group = {}
        group['items'] = []
        subgroup = None
    elif not line.startswith('   '):
        key, val = line.split(':')
        key = key.strip()
        val = val.strip()
        group[key] = val
    else:
        if '*' in line:
            if subgroup:
                group['items'].append(subgroup)
            subgroup = {}
        key, val = line.split(':')
        key = key.replace('*', '').strip()
        val = val.strip()
        subgroup[key] = val

group['items'].append(subgroup)            
results.append(group)

print(results)

并很好地显示

import json    
print(json.dumps(results, indent=2))

结果:

[
  {
    "items": [
      {
        "Link ID": "1.1.1.2",
        "Data": "255.255.255.255",
        "Link Type": "StubNet",
        "Metric": "1",
        "Priority": "Medium"
      },
      {
        "Link ID": "1.1.1.4",
        "Data": "192.168.100.34",
        "Link Type": "P-2-P",
        "Metric": "1"
      },
      {
        "Link ID": "192.168.100.33",
        "Data": "255.255.255.255",
        "Link Type": "StubNet",
        "Metric": "1",
        "Priority": "Medium"
      },
      {
        "Link ID": "1.1.1.1",
        "Data": "192.168.100.53",
        "Link Type": "P-2-P",
        "Metric": "1"
      }
    ],
    "Type": "Router",
    "Ls id": "1.1.1.2",
    "Adv rtr": "1.1.1.2",
    "Ls age": "201",
    "Len": "84",
    "Link count": "5"
  },
  {
    "items": [
      {
        "Link ID": "1.1.1.1",
        "Data": "255.255.255.255",
        "Link Type": "StubNet",
        "Metric": "1",
        "Priority": "Medium"
      },
      {
        "Link ID": "1.1.1.1",
        "Data": "255.255.255.255",
        "Link Type": "StubNet",
        "Metric": "12",
        "Priority": "Medium"
      },
      {
        "Link ID": "1.1.1.3",
        "Data": "192.168.100.26",
        "Link Type": "P-2-P",
        "Metric": "10"
      },
      {
        "Link ID": "192.168.100.25",
        "Data": "255.255.255.255",
        "Link Type": "StubNet",
        "Metric": "10",
        "Priority": "Medium"
      },
      {
        "Link ID": "1.1.1.2",
        "Data": "192.168.100.54",
        "Link Type": "P-2-P",
        "Metric": "10"
      },
      {
        "Link ID": "192.168.100.53",
        "Data": "255.255.255.255",
        "Link Type": "StubNet",
        "Metric": "10",
        "Priority": "Medium"
      }
    ],
    "Type": "Router",
    "Ls id": "1.1.1.1",
    "Adv rtr": "1.1.1.1",
    "Ls age": "1699",
    "Len": "96",
    "Options": "ASBR  E",
    "seq#": "80008d72",
    "chksum": "0x16fc",
    "Link count": "6"
  }
]

所以现在您有了Python结构,就可以得到想要的东西。

答案 1 :(得分:1)

仅获取P-2-P类型:

data = "..."

import json
result = {}
l = []
for i in data.split("\n\n"):
    if i:
        p = [parameter for parameter in i.split("*")]
        for line, x in enumerate(p[0].split("\n")):
            if x and "Ls id" in x:
                ls_id, ip = x.split(": ")
                ls_id = ls_id.strip()
                ip = ip.strip()
        for y in p[1:]:
            if y and "P-2-P" in y:
                temp = {ls_id:ip}
                for items in y.split("\n"):
                    try:
                        key, value = items.split(": ")
                        key = key.strip()
                        value = value.strip()
                        temp[key] = value
                    except ValueError:
                       pass
                l.append(temp)
result["oppf"] = l
print (json.dumps(result,indent=2))