将json数据转换为pandas数据帧

时间:2018-01-26 04:16:38

标签: python json pandas nested-lists

我正在使用python包censusgeocode对街道地址进行地理编码,并获取可用于合并其他人口普查数据的相应地理ID。

我有一个包含所有街道地址的csv文件,此代码可以正常加载程序,引入数据,并使用geocode函数遍历每个地址:

#For geocoding: 
import censusgeocode as cg

#For data handling: 
import pandas as pd

addresses = pd.read_csv('addresslist.csv') 
geo_set = []
#just test it for three addresses 
for index, row in addresses.iloc[0:2].iterrows():
     try:
         nextline = cg.address(str(row['residential_address']), city=str(row['mailing_city']), state=str(row['mailing_state']), zipcode=str(row['mailing_zip_code']))
         nextline
         geo_set.append(nextline)
     except:
         pass

这就是背景;以上所有工作都很好。我正在努力的是将结果输出转换为熊猫数据帧。这是我的代码:

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[]})
for p in geo_set:
for i in p['addressComponents']:
    new_result = pd.DataFrame({
        "fromAddress":[i['fromAddress']],
        "streetName":[i['streetName']],
        "suffixType":[i['suffixType']],               
        "state":[i['state']],                   
        "city":[i['city']],               
        "zip":[i['zip']]
    })
emptydata = emptydata.append(new_result) 

我尝试过更换一百万个不同的东西并不断收到错误消息。任何人都可以建议我的代码是如何出错的。我很确定这与我如何理解嵌套结构有关。我收到的错误是:

TypeError: list indices must be integers or slices, not str

以下是我试图在数据框中创建的数据:

[[{'addressComponents': {'city': 'BOULDER',
    'fromAddress': '1',
    'preDirection': 'E',
    'preQualifier': '',
    'preType': '',
    'state': 'CO',
    'streetName': 'REVEREND',
    'suffixDirection': '',
    'suffixQualifier': '',
    'suffixType': 'AVE',
    'toAddress': '99',
    'zip': '80211'},
   'coordinates': {'x': -135.98743, 'y': 43.714783},
   'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
      'AREAWATER': 0,
      'BASENAME': '4003',
      'BLKGRP': '4',
      'BLOCK': '4003',
      'CENTLAT': '+43.7156677',
      'CENTLON': '-135.9868842',
      'COUNTY': '031',
      'FUNCSTAT': 'S',
      'GEOID': '080300028024003',
      'INTPTLAT': '+43.7156677',
      'INTPTLON': '-135.9868842',
      'LSADC': 'BK',
      'LWBLKTYP': 'L',
      'MTFCC': 'G5040',
      'NAME': 'Block 4113',
      'OBJECTID': 6626210,
      'OID': 210403980440495,
      'STATE': '08',
      'SUFFIX': '',
      'TRACT': '002802'}],
    'Census Tracts': [{'status': 'Layer query encountered an error: java.lang.RuntimeException: Failed to return'}],
    'Counties': [{'AREALAND': 397083755,
      'AREAWATER': 4237705,
      'BASENAME': 'Boulder',
      'CENTLAT': '+43.7621497',
      'CENTLON': '-135.8760655',
      'COUNTY': '033',
      'COUNTYCC': 'H6',
      'COUNTYNS': '00198131',
      'FUNCSTAT': 'C',
      'GEOID': '08033',
      'INTPTLAT': '+43.7618502',
      'INTPTLON': '-135.8811054',
      'LSADC': '06',
      'MTFCC': 'G4020',
      'NAME': 'Boulder County',
      'OBJECTID': 625,
      'OID': 27590700234321,
      'STATE': '08'}],
    'States': [{'AREALAND': 268426005696,
      'AREAWATER': 1178507593,
      'BASENAME': 'Colorado',
      'CENTLAT': '+38.9976179',
      'CENTLON': '-105.5478280',
      'DIVISION': '8',
      'FUNCSTAT': 'A',
      'GEOID': '08',
      'INTPTLAT': '+38.9938482',
      'INTPTLON': '-105.5083165',
      'LSADC': '00',
      'MTFCC': 'G4000',
      'NAME': 'Colorado',
      'OBJECTID': 27,
      'OID': 2749086215995,
      'REGION': '4',
      'STATE': '08',
      'STATENS': '01779779',
      'STUSAB': 'CO'}]},
   'matchedAddress': '1 E BAYAUD AVE, DENVER, CO, 80209',
   'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}],
 [{'addressComponents': {'city': 'DENVER',
    'fromAddress': '1',
    'preDirection': 'E',
    'preQualifier': '',
    'preType': '',
    'state': 'CO',
    'streetName': 'REVEREND',
    'suffixDirection': '',
    'suffixQualifier': '',
    'suffixType': 'AVE',
    'toAddress': '99',
    'zip': '80209'},
   'coordinates': {'x': -135.98743, 'y': 43.714783},
   'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
      'AREAWATER': 0,
      'BASENAME': '4003',
      'BLKGRP': '4',
      'BLOCK': '4003',
      'CENTLAT': '+43.7156677',
      'CENTLON': '-135.9868842',
      'COUNTY': '033',
      'FUNCSTAT': 'S',
      'GEOID': '080330028024113',
      'INTPTLAT': '+43.7156677',
      'INTPTLON': '-135.9868842',
      'LSADC': 'BK',
      'LWBLKTYP': 'L',
      'MTFCC': 'G5041',
      'NAME': 'Block 4233',
      'OBJECTID': 6626210,
      'OID': 210403980440495,
      'STATE': '08',
      'SUFFIX': '',
      'TRACT': '002802'}],
    'Census Tracts': [{'AREALAND': 886991,
      'AREAWATER': 0,
      'BASENAME': '32.02',
      'CENTLAT': '+43.7177365',
      'CENTLON': '-135.9841763',
      'COUNTY': '031',
      'FUNCSTAT': 'S',
      'GEOID': '08033002802',
      'INTPTLAT': '+43.7177365',
      'INTPTLON': '-135.9841763',
      'LSADC': 'CT',
      'MTFCC': 'G5020',
      'NAME': 'Census Tract 41.02',
      'OBJECTID': 65498,
      'OID': 20790703831619,
      'STATE': '08',
      'TRACT': '002802'}],
    'Counties': [{'AREALAND': 397083755,
      'AREAWATER': 4237705,
      'BASENAME': 'Boulder',
      'CENTLAT': '+43.7621497',
      'CENTLON': '-135.8760655',
      'COUNTY': '033',
      'COUNTYCC': 'H6',
      'COUNTYNS': '00198133',
      'FUNCSTAT': 'C',
      'GEOID': '08033',
      'INTPTLAT': '+43.7618502',
      'INTPTLON': '-135.8811054',
      'LSADC': '06',
      'MTFCC': 'G4020',
      'NAME': 'Boulder County',
      'OBJECTID': 625,
      'OID': 27590700234321,
      'STATE': '08'}],
    'States': [{'AREALAND': 268426005696,
      'AREAWATER': 1178507593,
      'BASENAME': 'Colorado',
      'CENTLAT': '+43.9976179',
      'CENTLON': '-135.5478280',
      'DIVISION': '8',
      'FUNCSTAT': 'A',
      'GEOID': '08',
      'INTPTLAT': '+43.9938482',
      'INTPTLON': '-135.5083165',
      'LSADC': '00',
      'MTFCC': 'G4000',
      'NAME': 'Colorado',
      'OBJECTID': 27,
      'OID': 2749086215995,
      'REGION': '4',
      'STATE': '08',
      'STATENS': '01779779',
      'STUSAB': 'CO'}]},
   'matchedAddress': '1 E REVEREND AVE, BOULDER, CO, 88090',
   'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}]]

添加到原始帖子

我正在尝试在JSON文件的不同部分中提取更多的变量。它们都在树的'2010 Census Tracts'部分。通过运行此代码(根据您与我分享的内容改编):

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        d = i['addressComponents']
        e = i['geographies']
        for w in e:
            g = e['2010 Census Blocks']
            print(g)

我可以打印我想要的树的所有额外部分。但是当我尝试将其集成到提取变量并将它们附加到我的数据帧的部分时,我得到与以前相同的TypeError消息。

这是我的代码:

emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
    for i in p:
        d = i['addressComponents']
        e = i['geographies']
        for w in e:
            g = e['2010 Census Blocks']
            new_result = pd.DataFrame({
                "fromAddress":[d['fromAddress']],
                "streetName":[d['streetName']],
                "suffixType":[d['suffixType']],
                "state":[d['state']],
                "city":[d['city']],
                "zip":[d['zip']],
                "BASENAME":[g['BASENAME']],
                "CENTLAT":[g['CENTLAT']], 
                "COUNTY":[g['COUNTY']], 
                "GEOID":[g['GEOID']], 
                "NAME":[g['NAME']], 
                "BLKGRP":[g['BLKGRP']], 
                "BLOCK":[g['BLOCK']] 
            })
            emptydata = emptydata.append(new_result)

2 个答案:

答案 0 :(得分:1)

这里的问题是嵌套的复杂性,并且嵌套的for循环没有到达内层。您的输出包含一个嵌套有嵌套字典列表的列表。当您尝试深度geo_set深度p['addressComponents']时,p失败,因为p是嵌套字典的列表,而不是您预期的字典。您需要再次遍历i以访问包含密钥'addressComponents'的迭代字典emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]}) for p in geo_set: for i in p: add_comp = i['addressComponents'] census_block = i['geographies']['2010 Census Blocks'][0] new_result = pd.DataFrame({ "fromAddress":[add_comp['fromAddress']], "streetName":[add_comp['streetName']], "suffixType":[add_comp['suffixType']], "state":[add_comp['state']], "city":[add_comp['city']], "zip":[add_comp['zip']], "BASENAME": [census_block['BASENAME']], "CENTLAT": [census_block['CENTLAT']], "COUNTY": [census_block['COUNTY']], "GEOID": [census_block['GEOID']], "NAME": [census_block['NAME']], "BLKGRP": [census_block['BLKGRP']], "BLOCK": [census_block['BLOCK']] }) emptydata = emptydata.append(new_result) ,该密钥现在包含您要检索的所有项目:

  BASENAME BLKGRP BLOCK      CENTLAT COUNTY            GEOID        NAME  \
0     4003      4  4003  +43.7156677    031  080300028024003  Block 4113   
0     4003      4  4003  +43.7156677    033  080330028024113  Block 4233   

      city fromAddress state streetName suffixType    zip  
0  BOULDER           1    CO   REVEREND        AVE  80211  
0   DENVER           1    CO   REVEREND        AVE  80209

输出空数据:

TypeError: list indices must be integers or slices, not str

作为参考,这些都很容易调试 - 你收到的[]是一个很好的暗示切片出错了。由于切片使用for p in geo_set: print(p['addressComponents']) 语法,还有什么使用相同的语法?字典键,即p [' addressComponents']。如果你尝试过:

df_dict = {}
df_cols = ["fromAddress", "streetName", "suffixType", "state", "city", "zip", "BASENAME", "CENTLAT", "COUNTY", "GEOID", "NAME", "BLKGRP", "BLOCK"]
for p in geo_set:
    for i in p:
        for key, item in i['addressComponents'].items():
            if key in df_cols:
                df_dict.setdefault(key,[]).append(item)
        for d in i['geographies']['2010 Census Blocks']:
            for key, item in d.items():
                if key in df_cols:
                    df_dict.setdefault(key,[]).append(item)
emptydata = pd.DataFrame.from_dict(df_dict)

你会收到同样的错误。您现在已经成功缩小了错误来源,并可以通过逐步处理数据来恢复工作。

替代解决方案:

如果你不希望你的代码变得如此庞大,那么这就是字典驱动的方法:

In file included from /home/ubuntu/workspace/cs1440-hw2/Analyzer.h:8:0,
                 from /home/ubuntu/workspace/cs1440-hw2/Station.h:7,
                 from /home/ubuntu/workspace/cs1440-hw2/Day.h:6,
                 from /home/ubuntu/workspace/cs1440-hw2/main.cpp:7:
/home/ubuntu/workspace/cs1440-hw2/Region.h:12:3: error: ‘Station’ does not name a type
   Station*        _stations[MAX_STATION_COUNT];

输出相同,并且您最终不会创建这么多临时DataFrame对象。但需要注意的是,DataFrame的设置现在可读性较差。

再次,跟踪什么是列表以及数据中的字典是什么,并相应地进行迭代。

答案 1 :(得分:0)

你可以这样做:

emptydata = pd.DataFrame([{
        "fromAddress":[i['fromAddress']],
        "streetName":[i['streetName']],
        "suffixType":[i['suffixType']],               
        "state":[i['state']],                   
        "city":[i['city']],               
        "zip":[i['zip']]
    } for p in geo_set for i in p['addressComponents']])