如何在每日时间序列对象上迭代Webscraping脚本,以便从网页创建每日时间序列数据

时间:2019-04-06 22:16:43

标签: python pandas beautifulsoup time-series

感谢您查看我的问题。我使用BeautifulSoup和Pandas创建了一个脚本,该脚本从美联储网站上的预测中抓取数据。投影每季度(〜3个月)出现一次。我想编写一个脚本来创建每日时间序列,并每天检查一次美联储网站,如果发布了新的预测,该脚本会将其添加到时间序列中。如果没有更新,则脚本将只是将时间序列附加到最后一个有效的,更新的投影上。

从我的最初挖掘来看,似乎有一些外部资源可以每天用来“触发”脚本,但是我宁愿将所有内容都保留为纯python。

我为完成抓取而编写的代码如下:

from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd 

# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm" 
key = '/monetarypolicy/fomcprojtabl'

# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
    projections.append(link["href"])

# Create a tuple to store the projections 
decfcasts = []
for i in projections:
    url = "https://www.federalreserve.gov{}".format(i)
    file = wget.download(url)
    df_list = pd.read_html(file)
    fcast = df_list[-1].iloc[:,0:2]
    fcast.columns = ['Target', 'Votes']
    fcast.fillna(0, inplace = True)
    decfcasts.append(fcast)

到目前为止,我编写的代码将所有内容都放在一个元组中,但是该数据没有时间/日期索引。我一直在考虑编写伪代码,我想它看起来像

Create daily time series object
    for each day in time series:
        if day in time series = day in link:
            run webscraper
        other wise, append time series with last available observation

至少,这就是我的想法。最终的时间序列最终可能看起来相当“笨拙”,因为会有很多天都具有相同的观测值,然后当出现新的预测时,将出现“跳跃”,然后是更多重复直到下一个投影出来。

很明显,任何帮助都将不胜感激。无论哪种方式,都要提前谢谢!

1 个答案:

答案 0 :(得分:1)

我已经为您编辑了代码。现在,它从url获取日期。日期在数据框中保存为期间。仅当数据帧中不存在日期(从泡菜还原)时,才会处理并附加日期。

from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd

# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
key = '/monetarypolicy/fomcprojtabl'

# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
    projections.append(link["href"])

# past results from pickle, when no pickle init empty dataframe
try:
    decfcasts = pd.read_pickle('decfcasts.pkl')
except FileNotFoundError:
    decfcasts = pd.DataFrame(columns=['target', 'votes', 'date'])


for i in projections:

    # parse date from url
    date = pd.Period(''.join(re.findall(r'\d+', i)), 'D')

    # process projection if it wasn't included in data from pickle
    if date not in decfcasts['date'].values:

        url = "https://www.federalreserve.gov{}".format(i)
        file = wget.download(url)
        df_list = pd.read_html(file)
        fcast = df_list[-1].iloc[:, 0:2]
        fcast.columns = ['target', 'votes']
        fcast.fillna(0, inplace=True)

        # set date time
        fcast.insert(2, 'date', date)
        decfcasts = decfcasts.append(fcast)

# save to pickle
pd.to_pickle(decfcasts, 'decfcasts.pkl')