我想从满足URL结构中显示的特定日期要求的URL中提取数据,然后将该信息放入要在本地使用的csvs中。
http://web.mta.info/developers/data/nyct/turnstile/turnstile_190629.txt
URL末尾的6位数字序列是年月日指示符:190629
我正在收集2016年至2019年(16-19)从3月到6月(03-06)的数据。如果该URL存在,请创建一个csv,并将它们全部组合到一个csv中,以输入到pandas数据框中。
这行得通,但是suuuuuper速度很慢,我知道这不是最Python化的方式。
import requests
import pandas as pd
import itertools
date_list = [['16', '17', '18', '19'],['03', '04', '05', '06'],['01', '02', '03', '03', '04', '05', '06'
,'07', '08', '09','10', '11', '12','13','14' ,'15', '16',
'17','18','19','20','21','22','23','24','25','26','27'
,'28','29','30','31']]
date_combo = []
# - create year - month - date combos
# - link: https://stackoverflow.com/questions/798854/all-combinations-of-a-list-of-lists
for sub_list in itertools.product(*date_list):
date_combo.append(sub_list)
url_lead = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_'
url_list = []
# - this checks the url is valid and adds them to a list
for year, month, day in date_combo:
concat_url = url_lead + year + month + day + '.txt'
response = requests.get(concat_url)
if response.status_code == 200:
# ---- creates a list of active urls
url_list.append(concat_url)
# ---- this creates individual csvs ---- change path for saving locally
# ---- filename is date
df = pd.read_csv(concat_url, header = 0, sep = ',')
df.to_csv(r'/Users/.../GitHub/' + year + month + day + '.csv')
# - this creates a master df for all urls
dfs = [pd.read_csv(url,header = 0, sep = ',') for url in url_list]
df = pd.concat(dfs, ignore_index = True)
df.to_csv(r'/Users/.../GitHub/seasonal_mta_data_01.csv')
我的代码正在按预期运行,但是希望您有任何清理建议!
答案 0 :(得分:1)
我能想到的不多。以下是一些我会做不同的事情:
# more consie construction of date_combo
date_list = [range(16,20), range(3,7),range(1,32)]
date_combo = [sub_list for sub_list in itertools.product(*date_list)]
url_lead = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_'
url_list = []
dfs = []
# - this checks the url is valid and adds them to a list
for year, month, day in date_combo:
# year, month, day are integers
# so we use f string here
concat_url = f'{url_lead}{year}{month:02}{day:02}.txt'
response = requests.get(concat_url)
if response.status_code == 200:
url_list.append(concat_url)
# append to dfs and save csv
dfs.append(pd.read_csv(concat_url, header = 0, sep = ','))
dfs[-1].to_csv(f'/Users/.../GitHub/{year}{month:02}{day:02}.csv)
# we don't need to request the txt files again:
df = pd.concat(dfs, ignore_index = True)
df.to_csv(r'/Users/.../GitHub/seasonal_mta_data_01.csv')
答案 1 :(得分:0)
@Quang Huang的回答非常好。
说实话,我从未使用过与itertools.product(*date_list)
类似的东西,所以我将以不同的方式生成日期。
d = pd.to_datetime(pd.date_range(start='2016/03/01', end='2019/06/30')).strftime('%Y%m%d')
dates = [i[2:] for i in d]
# dates[:2]
['160301', '160302']
因此:
for date in dates:
concat_url = f'{url_lead}{date}.txt')
...