我正在尝试创建一个对数据集进行迭代并使用数据集中的术语来查询api的函数。我已将问题隔离到此功能。我需要它至少使用两次提供的邮政编码来调用api,然后再移至同一区域内的下一个邮政编码。一旦在同一区域内提取了至少20个邮政编码的数据,我就需要将其移至下一个区域并重新开始该过程。但是,我无法弄清楚如何将该逻辑转换为python。您能提供的任何帮助将不胜感激。
def get_zip(data):
df = pd.read_csv(data, converters={'zip': lambda x: '{0:0>5}'.format(x)})
dfs = pd.DataFrame(df[['zip', 'region']])
regions = dfs['region'].unique().tolist()
i = 1
while i < len(regions):
# print(regions[i])
zlen = len(df.zip[df.region == '%s' % regions[i]])
print(zlen)
print(i)
if i in range(min(zlen, 20)):
zipcode = df.zip[df.region == '%s' % regions[i]].iloc[i]
i += 1
return zipcode
else:
zipcode = df.zip[df.region == '%s' % regions[i]].iloc[i]
return zipcode
get_zip(metro_data.csv)
metro_data.csv结构如下:
zip region
0 29831 Augusta-Richmond County, GA-SC Metro Area
1 29129 Augusta-Richmond County, GA-SC Metro Area
2 30808 Augusta-Richmond County, GA-SC Metro Area
3 29809 Augusta-Richmond County, GA-SC Metro Area
4 29137 Augusta-Richmond County, GA-SC Metro Area
5 29851 Augusta-Richmond County, GA-SC Metro Area
6 30816 Augusta-Richmond County, GA-SC Metro Area
7 30805 Augusta-Richmond County, GA-SC Metro Area
8 29105 Augusta-Richmond County, GA-SC Metro Area
9 30426 Augusta-Richmond County, GA-SC Metro Area
10 29856 Augusta-Richmond County, GA-SC Metro Area
11 29834 Augusta-Richmond County, GA-SC Metro Area
12 29828 Augusta-Richmond County, GA-SC Metro Area
13 30812 Augusta-Richmond County, GA-SC Metro Area
800 31721 Albany, GA Metro Area
801 39842 Albany, GA Metro Area
802 31763 Albany, GA Metro Area
803 31791 Albany, GA Metro Area
804 39870 Albany, GA Metro Area
805 31787 Albany, GA Metro Area
806 31781 Albany, GA Metro Area
813 27801 Rocky Mount, NC Metro Area
814 27804 Rocky Mount, NC Metro Area
815 27886 Rocky Mount, NC Metro Area
816 27803 Rocky Mount, NC Metro Area
817 27856 Rocky Mount, NC Metro Area
818 27891 Rocky Mount, NC Metro Area
819 27882 Rocky Mount, NC Metro Area
820 27809 Rocky Mount, NC Metro Area
821 27864 Rocky Mount, NC Metro Area
822 27557 Rocky Mount, NC Metro Area
答案 0 :(得分:0)
将所有回报更改为收益
if i in range(min(zlen, 20)):
zipcode = df.zip[df.cbsa_name == '%s' % regions[i]].iloc[i]
i += 1
yield zipcode
else:
zipcode = df.zip[df.cbsa_name == '%s' % regions[i]].iloc[i]
yield zipcode
使用:
data=get_zip("metro_data.csv")
firstiter=data.__next__()
seconditer=data.__next__()
答案 1 :(得分:0)
您写了您想要的邮政编码,该邮政编码显示在同一区域
至少两次。此(更精确-带有这些邮政编码的df
行)
可以得到如下:
df2 = df.groupby(['region', 'zip']).filter(lambda gr: len(gr) > 1)
出于演示目的,我更改了您的源数据,并重复了一些 zip,因此结果(针对我的测试数据)为:
zip region
ind
0 29831 Augusta-Richmond County, GA-SC Metro Area
1 29831 Augusta-Richmond County, GA-SC Metro Area
2 30808 Augusta-Richmond County, GA-SC Metro Area
3 30808 Augusta-Richmond County, GA-SC Metro Area
4 29137 Augusta-Richmond County, GA-SC Metro Area
5 29137 Augusta-Richmond County, GA-SC Metro Area
7 30805 Augusta-Richmond County, GA-SC Metro Area
8 30805 Augusta-Richmond County, GA-SC Metro Area
802 31763 Albany, GA Metro Area
803 31763 Albany, GA Metro Area
805 31787 Albany, GA Metro Area
806 31787 Albany, GA Metro Area
814 27804 Rocky Mount, NC Metro Area
815 27804 Rocky Mount, NC Metro Area
如您所见,有:
其他邮政编码不重复。
出于演示目的,我降低了 每个区域减少到3个(而不是您的20个)。
然后,要获取zip的“受限”列表,您可以编写:
df2.groupby(['region']).apply(
lambda x: pd.Series(x.zip.unique()[:3]))\
.reset_index(level=1, drop=True).rename('zip')
获取:
region
Albany, GA Metro Area 31763
Albany, GA Metro Area 31787
Augusta-Richmond County, GA-SC Metro Area 29831
Augusta-Richmond County, GA-SC Metro Area 30808
Augusta-Richmond County, GA-SC Metro Area 29137
Rocky Mount, NC Metro Area 27804
Name: zip, dtype: object
如您所见, Augusta-Richmond 地区的邮政编码数量 已降低至3。
因此,现在您有了zip列表,可以随心所欲地处理, 例如调用一些API。