我正在处理具有三种不同口味的json文件。
主要区别在于dmaKey
和dmaDescription
中的某些字段存在于其中一列的字典的外部或内部,或者根本不存在。
例如,将我的文件从json转换为pandas df时,此数据帧的输出正确。
file_2 = [{'weekStartDate': '2019-09-02',
'dmaKey': '803',
'dmaDescription': 'Los Angeles',
'scxRequiredFields': {'dmaPopulation': 5476830,
'systemPopulation': 3810053,
'strataMarketName': 'Los Angeles'}},
{'weekStartDate': '2019-09-16',
'dmaKey': '803',
'dmaDescription': 'Los Angeles',
'scxRequiredFields': {'dmaPopulation': 5476830,
'systemPopulation': 3810053,
'strataMarketName': 'Los Angeles'}}]
df2 = pd.DataFrame(file_2)
df2
我的两个问题是如何检查dmaKey
和dmaDescription
是否与上面相同的文件结构中不存在,如果是这种情况,我该如何匹配上面的文件输出,并从dmaKey
字典列中抓取dmaDescription
和scxRequiredFields
来创建自己的列?
file_1 =[{'weekStartDate': '2020-08-17',
'scxRequiredFields': {'dmaPopulation': 771210,
'systemPopulation': 10407,
'strataMarketName': 'Austin',
'dmaKey': '635',
'dmaDescription': 'Austin',
'eclipseRegionName': 'TEXAS'}},
{'weekStartDate': '2020-08-10',
'scxRequiredFields': {'dmaPopulation': 771210,
'systemPopulation': 10407,
'strataMarketName': 'Austin',
'dmaKey': '635',
'dmaDescription': 'Austin',
'eclipseRegionName': 'TEXAS'}}]
df1 = pd.DataFrame(file_1)
df1
有时候dmaKey
和dmaDescription
也完全丢失,所以这是我需要考虑的第三种情况。
file_3 =[{'weekStartDate': '2020-09-17',
'scxRequiredFields': {'dmaPopulation': 7712105432,
'systemPopulation': 10407123,
'strataMarketName': 'Austin',
'eclipseRegionName': 'TEXAS'}},
{'weekStartDate': '2020-09-10',
'scxRequiredFields': {'dmaPopulation': 7712101234,
'systemPopulation': 10407123,
'strataMarketName': 'Austin',
'eclipseRegionName': 'TEXAS'}}]
df3 = pd.DataFrame(file_3)
df3
答案 0 :(得分:1)
我认为这是在创建数据框之前处理数据的问题。这是一些执行此操作的代码:
def populateValues( json_file, keys ):
for row in json_file:
for k in ('dmaKey','dmaDescription'):
if k not in row:
if 'scxRequiredFields' in row:
row[k] = row['scxRequiredFields'].get( k , defaults(row,k) )
else:
row[k] = defaults(row,k)
这是调用该方法的方式(无论如何,这是我的测试代码):
keys = ('dmaKey','dmaDescription')
for test in ( file_2, file_3, file_1 ):
populateValues( test, keys )
json.dump( test, sys.stdout, indent = 2 )
print()
print()
调用populateValues()
的结果是对数据结构test
进行了处理,以便所有丢失的键都填充在顶级字典中。
如果字典或“ sceRequiredFields”子类中的值完全丢失,则您没有指定该怎么做。相反,我只是调用了一个名为defaults(row,k)
的函数,以便该函数可以根据行中的任何内容为给定键k
指定一个默认值。这是一个简单的函数,仅将键名与“默认”连接起来:
def defaults( r, k ) :
return f"Default: {k}"
这是测试循环的输出(我添加了一些注释,以便您可以看到不同的输入文件结果):
# file_2 - no fixes required
[
{
"weekStartDate": "2019-09-02",
"dmaKey": "803",
"dmaDescription": "Los Angeles",
"scxRequiredFields": {
"dmaPopulation": 5476830,
"systemPopulation": 3810053,
"strataMarketName": "Los Angeles"
}
},
{
"weekStartDate": "2019-09-16",
"dmaKey": "803",
"dmaDescription": "Los Angeles",
"scxRequiredFields": {
"dmaPopulation": 5476830,
"systemPopulation": 3810053,
"strataMarketName": "Los Angeles"
}
}
]
# file_3 - only dumb default values get inserted
[
{
"weekStartDate": "2020-09-17",
"scxRequiredFields": {
"dmaPopulation": 7712105432,
"systemPopulation": 10407123,
"strataMarketName": "Austin",
"eclipseRegionName": "TEXAS"
},
"dmaKey": "Default: dmaKey",
"dmaDescription": "Default: dmaDescription"
},
{
"weekStartDate": "2020-09-10",
"scxRequiredFields": {
"dmaPopulation": 7712101234,
"systemPopulation": 10407123,
"strataMarketName": "Austin",
"eclipseRegionName": "TEXAS"
},
"dmaKey": "Default: dmaKey",
"dmaDescription": "Default: dmaDescription"
}
]
# file_1 - Values are taken from scxRequiredFields
[
{
"weekStartDate": "2020-08-17",
"scxRequiredFields": {
"dmaPopulation": 771210,
"systemPopulation": 10407,
"strataMarketName": "Austin",
"dmaKey": "635",
"dmaDescription": "Austin",
"eclipseRegionName": "TEXAS"
},
"dmaKey": "635",
"dmaDescription": "Austin"
},
{
"weekStartDate": "2020-08-10",
"scxRequiredFields": {
"dmaPopulation": 771210,
"systemPopulation": 10407,
"strataMarketName": "Austin",
"dmaKey": "635",
"dmaDescription": "Austin",
"eclipseRegionName": "TEXAS"
},
"dmaKey": "635",
"dmaDescription": "Austin"
}
]