我正在使用python包censusgeocode
对街道地址进行地理编码,并获取可用于合并其他人口普查数据的相应地理ID。
我有一个包含所有街道地址的csv文件,此代码可以正常加载程序,引入数据,并使用geocode
函数遍历每个地址:
#For geocoding:
import censusgeocode as cg
#For data handling:
import pandas as pd
addresses = pd.read_csv('addresslist.csv')
geo_set = []
#just test it for three addresses
for index, row in addresses.iloc[0:2].iterrows():
try:
nextline = cg.address(str(row['residential_address']), city=str(row['mailing_city']), state=str(row['mailing_state']), zipcode=str(row['mailing_zip_code']))
nextline
geo_set.append(nextline)
except:
pass
这就是背景;以上所有工作都很好。我正在努力的是将结果输出转换为熊猫数据帧。这是我的代码:
emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[]})
for p in geo_set:
for i in p['addressComponents']:
new_result = pd.DataFrame({
"fromAddress":[i['fromAddress']],
"streetName":[i['streetName']],
"suffixType":[i['suffixType']],
"state":[i['state']],
"city":[i['city']],
"zip":[i['zip']]
})
emptydata = emptydata.append(new_result)
我尝试过更换一百万个不同的东西并不断收到错误消息。任何人都可以建议我的代码是如何出错的。我很确定这与我如何理解嵌套结构有关。我收到的错误是:
TypeError: list indices must be integers or slices, not str
以下是我试图在数据框中创建的数据:
[[{'addressComponents': {'city': 'BOULDER',
'fromAddress': '1',
'preDirection': 'E',
'preQualifier': '',
'preType': '',
'state': 'CO',
'streetName': 'REVEREND',
'suffixDirection': '',
'suffixQualifier': '',
'suffixType': 'AVE',
'toAddress': '99',
'zip': '80211'},
'coordinates': {'x': -135.98743, 'y': 43.714783},
'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
'AREAWATER': 0,
'BASENAME': '4003',
'BLKGRP': '4',
'BLOCK': '4003',
'CENTLAT': '+43.7156677',
'CENTLON': '-135.9868842',
'COUNTY': '031',
'FUNCSTAT': 'S',
'GEOID': '080300028024003',
'INTPTLAT': '+43.7156677',
'INTPTLON': '-135.9868842',
'LSADC': 'BK',
'LWBLKTYP': 'L',
'MTFCC': 'G5040',
'NAME': 'Block 4113',
'OBJECTID': 6626210,
'OID': 210403980440495,
'STATE': '08',
'SUFFIX': '',
'TRACT': '002802'}],
'Census Tracts': [{'status': 'Layer query encountered an error: java.lang.RuntimeException: Failed to return'}],
'Counties': [{'AREALAND': 397083755,
'AREAWATER': 4237705,
'BASENAME': 'Boulder',
'CENTLAT': '+43.7621497',
'CENTLON': '-135.8760655',
'COUNTY': '033',
'COUNTYCC': 'H6',
'COUNTYNS': '00198131',
'FUNCSTAT': 'C',
'GEOID': '08033',
'INTPTLAT': '+43.7618502',
'INTPTLON': '-135.8811054',
'LSADC': '06',
'MTFCC': 'G4020',
'NAME': 'Boulder County',
'OBJECTID': 625,
'OID': 27590700234321,
'STATE': '08'}],
'States': [{'AREALAND': 268426005696,
'AREAWATER': 1178507593,
'BASENAME': 'Colorado',
'CENTLAT': '+38.9976179',
'CENTLON': '-105.5478280',
'DIVISION': '8',
'FUNCSTAT': 'A',
'GEOID': '08',
'INTPTLAT': '+38.9938482',
'INTPTLON': '-105.5083165',
'LSADC': '00',
'MTFCC': 'G4000',
'NAME': 'Colorado',
'OBJECTID': 27,
'OID': 2749086215995,
'REGION': '4',
'STATE': '08',
'STATENS': '01779779',
'STUSAB': 'CO'}]},
'matchedAddress': '1 E BAYAUD AVE, DENVER, CO, 80209',
'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}],
[{'addressComponents': {'city': 'DENVER',
'fromAddress': '1',
'preDirection': 'E',
'preQualifier': '',
'preType': '',
'state': 'CO',
'streetName': 'REVEREND',
'suffixDirection': '',
'suffixQualifier': '',
'suffixType': 'AVE',
'toAddress': '99',
'zip': '80209'},
'coordinates': {'x': -135.98743, 'y': 43.714783},
'geographies': {'2010 Census Blocks': [{'AREALAND': 21481,
'AREAWATER': 0,
'BASENAME': '4003',
'BLKGRP': '4',
'BLOCK': '4003',
'CENTLAT': '+43.7156677',
'CENTLON': '-135.9868842',
'COUNTY': '033',
'FUNCSTAT': 'S',
'GEOID': '080330028024113',
'INTPTLAT': '+43.7156677',
'INTPTLON': '-135.9868842',
'LSADC': 'BK',
'LWBLKTYP': 'L',
'MTFCC': 'G5041',
'NAME': 'Block 4233',
'OBJECTID': 6626210,
'OID': 210403980440495,
'STATE': '08',
'SUFFIX': '',
'TRACT': '002802'}],
'Census Tracts': [{'AREALAND': 886991,
'AREAWATER': 0,
'BASENAME': '32.02',
'CENTLAT': '+43.7177365',
'CENTLON': '-135.9841763',
'COUNTY': '031',
'FUNCSTAT': 'S',
'GEOID': '08033002802',
'INTPTLAT': '+43.7177365',
'INTPTLON': '-135.9841763',
'LSADC': 'CT',
'MTFCC': 'G5020',
'NAME': 'Census Tract 41.02',
'OBJECTID': 65498,
'OID': 20790703831619,
'STATE': '08',
'TRACT': '002802'}],
'Counties': [{'AREALAND': 397083755,
'AREAWATER': 4237705,
'BASENAME': 'Boulder',
'CENTLAT': '+43.7621497',
'CENTLON': '-135.8760655',
'COUNTY': '033',
'COUNTYCC': 'H6',
'COUNTYNS': '00198133',
'FUNCSTAT': 'C',
'GEOID': '08033',
'INTPTLAT': '+43.7618502',
'INTPTLON': '-135.8811054',
'LSADC': '06',
'MTFCC': 'G4020',
'NAME': 'Boulder County',
'OBJECTID': 625,
'OID': 27590700234321,
'STATE': '08'}],
'States': [{'AREALAND': 268426005696,
'AREAWATER': 1178507593,
'BASENAME': 'Colorado',
'CENTLAT': '+43.9976179',
'CENTLON': '-135.5478280',
'DIVISION': '8',
'FUNCSTAT': 'A',
'GEOID': '08',
'INTPTLAT': '+43.9938482',
'INTPTLON': '-135.5083165',
'LSADC': '00',
'MTFCC': 'G4000',
'NAME': 'Colorado',
'OBJECTID': 27,
'OID': 2749086215995,
'REGION': '4',
'STATE': '08',
'STATENS': '01779779',
'STUSAB': 'CO'}]},
'matchedAddress': '1 E REVEREND AVE, BOULDER, CO, 88090',
'tigerLine': {'side': 'L', 'tigerLineId': '177330882'}}]]
添加到原始帖子
我正在尝试在JSON文件的不同部分中提取更多的变量。它们都在树的'2010 Census Tracts'
部分。通过运行此代码(根据您与我分享的内容改编):
emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
for i in p:
d = i['addressComponents']
e = i['geographies']
for w in e:
g = e['2010 Census Blocks']
print(g)
我可以打印我想要的树的所有额外部分。但是当我尝试将其集成到提取变量并将它们附加到我的数据帧的部分时,我得到与以前相同的TypeError
消息。
这是我的代码:
emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
for i in p:
d = i['addressComponents']
e = i['geographies']
for w in e:
g = e['2010 Census Blocks']
new_result = pd.DataFrame({
"fromAddress":[d['fromAddress']],
"streetName":[d['streetName']],
"suffixType":[d['suffixType']],
"state":[d['state']],
"city":[d['city']],
"zip":[d['zip']],
"BASENAME":[g['BASENAME']],
"CENTLAT":[g['CENTLAT']],
"COUNTY":[g['COUNTY']],
"GEOID":[g['GEOID']],
"NAME":[g['NAME']],
"BLKGRP":[g['BLKGRP']],
"BLOCK":[g['BLOCK']]
})
emptydata = emptydata.append(new_result)
答案 0 :(得分:1)
这里的问题是嵌套的复杂性,并且嵌套的for循环没有到达内层。您的输出包含一个嵌套有嵌套字典列表的列表。当您尝试深度geo_set
深度p['addressComponents']
时,p
失败,因为p
是嵌套字典的列表,而不是您预期的字典。您需要再次遍历i
以访问包含密钥'addressComponents'
的迭代字典emptydata = pd.DataFrame({"fromAddress":[], "streetName":[], "suffixType":[], "state":[], "city":[], "zip":[], "BASENAME": [], "CENTLAT": [], "COUNTY":[], "GEOID":[], "NAME":[], "BLKGRP":[], "BLOCK":[]})
for p in geo_set:
for i in p:
add_comp = i['addressComponents']
census_block = i['geographies']['2010 Census Blocks'][0]
new_result = pd.DataFrame({
"fromAddress":[add_comp['fromAddress']],
"streetName":[add_comp['streetName']],
"suffixType":[add_comp['suffixType']],
"state":[add_comp['state']],
"city":[add_comp['city']],
"zip":[add_comp['zip']],
"BASENAME": [census_block['BASENAME']],
"CENTLAT": [census_block['CENTLAT']],
"COUNTY": [census_block['COUNTY']],
"GEOID": [census_block['GEOID']],
"NAME": [census_block['NAME']],
"BLKGRP": [census_block['BLKGRP']],
"BLOCK": [census_block['BLOCK']]
})
emptydata = emptydata.append(new_result)
,该密钥现在包含您要检索的所有项目:
BASENAME BLKGRP BLOCK CENTLAT COUNTY GEOID NAME \
0 4003 4 4003 +43.7156677 031 080300028024003 Block 4113
0 4003 4 4003 +43.7156677 033 080330028024113 Block 4233
city fromAddress state streetName suffixType zip
0 BOULDER 1 CO REVEREND AVE 80211
0 DENVER 1 CO REVEREND AVE 80209
输出空数据:
TypeError: list indices must be integers or slices, not str
作为参考,这些都很容易调试 - 你收到的[]
是一个很好的暗示切片出错了。由于切片使用for p in geo_set:
print(p['addressComponents'])
语法,还有什么使用相同的语法?字典键,即p [' addressComponents']。如果你尝试过:
df_dict = {}
df_cols = ["fromAddress", "streetName", "suffixType", "state", "city", "zip", "BASENAME", "CENTLAT", "COUNTY", "GEOID", "NAME", "BLKGRP", "BLOCK"]
for p in geo_set:
for i in p:
for key, item in i['addressComponents'].items():
if key in df_cols:
df_dict.setdefault(key,[]).append(item)
for d in i['geographies']['2010 Census Blocks']:
for key, item in d.items():
if key in df_cols:
df_dict.setdefault(key,[]).append(item)
emptydata = pd.DataFrame.from_dict(df_dict)
你会收到同样的错误。您现在已经成功缩小了错误来源,并可以通过逐步处理数据来恢复工作。
如果你不希望你的代码变得如此庞大,那么这就是字典驱动的方法:
In file included from /home/ubuntu/workspace/cs1440-hw2/Analyzer.h:8:0,
from /home/ubuntu/workspace/cs1440-hw2/Station.h:7,
from /home/ubuntu/workspace/cs1440-hw2/Day.h:6,
from /home/ubuntu/workspace/cs1440-hw2/main.cpp:7:
/home/ubuntu/workspace/cs1440-hw2/Region.h:12:3: error: ‘Station’ does not name a type
Station* _stations[MAX_STATION_COUNT];
输出相同,并且您最终不会创建这么多临时DataFrame对象。但需要注意的是,DataFrame的设置现在可读性较差。
再次,跟踪什么是列表以及数据中的字典是什么,并相应地进行迭代。
答案 1 :(得分:0)
你可以这样做:
emptydata = pd.DataFrame([{
"fromAddress":[i['fromAddress']],
"streetName":[i['streetName']],
"suffixType":[i['suffixType']],
"state":[i['state']],
"city":[i['city']],
"zip":[i['zip']]
} for p in geo_set for i in p['addressComponents']])