使用pandas进行json文档过滤

时间:2017-10-14 19:52:36

标签: python json pandas

假设有一个json文档: {“词汇表":{“GlossDiv":{“GlossList":{“GlossEntry":{‘GlossDef":{‘GlossSeeAlso’:[‘XML’,’XLS" ]}}}}}}

如果我使用pandas.io.normalize并将其展平为数据帧结构。之后如果我想搜索数据帧是否有任何匹配json查询的行: {“词汇表":{“GlossDiv":{“GlossList":{“GlossEntry":{‘GlossDef":{‘GlossSeeAlso’:[’XLS&#34]}}} }}}

file1.json:

[{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
},
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML","DSG"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}]

file2.json:

{“glossary":{“GlossDiv":{“GlossList":{“GlossEntry":{“GlossDef":{“GlossSeeAlso”:[“DSG"]}}}}}}

预计输出1行。

我将如何做同样的事情? 假设file1.json有多个记录必须根据file2.json中存在的单个json记录进行过滤

import pandas as pd
from pandas.io.json import json_normalize
file1=open('file1.json')
file2=open('file2.json')
records = json.load(file1)
df = json_normalize(records)

filter_record=json.load(file2)
#Need to filter df such that all the rows satisfy column values in filter_record
# Code : 

1 个答案:

答案 0 :(得分:1)

使用:

df1 = json_normalize(a)
#print(df1)

df2 = json_normalize(b)
#print(df2)

#filter columns from df2 if contains df1
df = df1[df2.columns.intersection(df1.columns)]
#print (df)

#create sets
a = np.array([set(x) for x in df.iloc[:, 0].tolist()])
b = np.array([set(x) for x in df2.iloc[:, 0].tolist()])
print (a)
[{'XML', 'GML'} {'XML', 'DSG', 'GML'}]
print (b)
[{'DSG'}]

#testing match
matches = (b[:, None] <= a)
print (matches)
[[False  True]]

#flatenning
any_ = matches[0]

#test if not NaNs
nul_ = df.iloc[:, 0].notnull().values
mask = any_ & nul_
print (mask)
[False  True]
#boolean indexing
df1 = df1[mask]
print (df1)

  glossary.GlossDiv.GlossList.GlossEntry.Abbrev  \
1                                 ISO 8879:1986   

  glossary.GlossDiv.GlossList.GlossEntry.Acronym  \
1                                           SGML   

  glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso  \
1                                    [GML, XML, DSG]             

  glossary.GlossDiv.GlossList.GlossEntry.GlossDef.para  \
1  A meta-markup language, used to create markup ...     

  glossary.GlossDiv.GlossList.GlossEntry.GlossSee  \
1                                          markup   

  glossary.GlossDiv.GlossList.GlossEntry.GlossTerm  \
1             Standard Generalized Markup Language   

  glossary.GlossDiv.GlossList.GlossEntry.ID  \
1                                      SGML   

  glossary.GlossDiv.GlossList.GlossEntry.SortAs glossary.GlossDiv.title  \
1                                          SGML                       S   

     glossary.title  
1  example glossary