假设有一个json文档: {“词汇表":{“GlossDiv":{“GlossList":{“GlossEntry":{‘GlossDef":{‘GlossSeeAlso’:[‘XML’,’XLS" ]}}}}}}
如果我使用pandas.io.normalize并将其展平为数据帧结构。之后如果我想搜索数据帧是否有任何匹配json查询的行: {“词汇表":{“GlossDiv":{“GlossList":{“GlossEntry":{‘GlossDef":{‘GlossSeeAlso’:[’XLS&#34]}}} }}}
file1.json:
[{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
},
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML","DSG"]
},
"GlossSee": "markup"
}
}
}
}
}]
file2.json:
{“glossary":{“GlossDiv":{“GlossList":{“GlossEntry":{“GlossDef":{“GlossSeeAlso”:[“DSG"]}}}}}}
预计输出1行。
我将如何做同样的事情? 假设file1.json有多个记录必须根据file2.json中存在的单个json记录进行过滤
import pandas as pd
from pandas.io.json import json_normalize
file1=open('file1.json')
file2=open('file2.json')
records = json.load(file1)
df = json_normalize(records)
filter_record=json.load(file2)
#Need to filter df such that all the rows satisfy column values in filter_record
# Code :
答案 0 :(得分:1)
使用:
df1 = json_normalize(a)
#print(df1)
df2 = json_normalize(b)
#print(df2)
#filter columns from df2 if contains df1
df = df1[df2.columns.intersection(df1.columns)]
#print (df)
#create sets
a = np.array([set(x) for x in df.iloc[:, 0].tolist()])
b = np.array([set(x) for x in df2.iloc[:, 0].tolist()])
print (a)
[{'XML', 'GML'} {'XML', 'DSG', 'GML'}]
print (b)
[{'DSG'}]
#testing match
matches = (b[:, None] <= a)
print (matches)
[[False True]]
#flatenning
any_ = matches[0]
#test if not NaNs
nul_ = df.iloc[:, 0].notnull().values
mask = any_ & nul_
print (mask)
[False True]
#boolean indexing
df1 = df1[mask]
print (df1)
glossary.GlossDiv.GlossList.GlossEntry.Abbrev \
1 ISO 8879:1986
glossary.GlossDiv.GlossList.GlossEntry.Acronym \
1 SGML
glossary.GlossDiv.GlossList.GlossEntry.GlossDef.GlossSeeAlso \
1 [GML, XML, DSG]
glossary.GlossDiv.GlossList.GlossEntry.GlossDef.para \
1 A meta-markup language, used to create markup ...
glossary.GlossDiv.GlossList.GlossEntry.GlossSee \
1 markup
glossary.GlossDiv.GlossList.GlossEntry.GlossTerm \
1 Standard Generalized Markup Language
glossary.GlossDiv.GlossList.GlossEntry.ID \
1 SGML
glossary.GlossDiv.GlossList.GlossEntry.SortAs glossary.GlossDiv.title \
1 SGML S
glossary.title
1 example glossary