python将几个数据框与日期合并

时间:2018-11-01 12:30:07

标签: python pandas date dataframe

此代码的目的是抓取一堆URL,然后从每个URL中提取数据表。

将表格转换为pandas数据框,确定日期,并删除不必要的列,重命名这些列,然后将它们全部合并为一个统一的数据框,并以日期作为索引,以便按日期对数据进行排序,因此同时发生的事件应该在同一行中。 串联前的原始数据:

Release Date Argentina Economic Activity YoY

0 2018-10-25 21:00:00+02:00                           -1.6%

1 2018-09-26 21:00:00+02:00                           -2.7%

2 2018-08-23 21:00:00+02:00                           -6.7%

3 2018-07-24 21:00:00+02:00                           -5.8%

4 2018-06-26 21:00:00+02:00                           -0.9%

               Release Date Argentina Gross Domestic Product (GDP) YoY

0 2018-09-19 22:00:00+02:00                                      -4.2%

1 2018-06-19 21:00:00+02:00                                       3.6%

2 2018-03-21 21:00:00+02:00                                       3.9%

3 2017-12-20 22:00:00+02:00                                       4.2%

4 2017-09-21 21:00:00+02:00                                       2.7%

但是在串联之后发生的是,不同的日期在同一行中,所以可以说它们是3个表,我会在第一行中找到三个日期,然后在第二行中找到......等等。

赞:

2018-01-24 22:00:00+02:00, 2016-06-29 21:00:00...                            3.9%                                       0.5%

(2018-02-28 22:00:00+02:00, 2016-09-22 21:00:00...                            2.0%                                      -3.4%

(2018-03-28 21:00:00+02:00, 2016-12-22 22:00:00...                            4.1%                                      -3.8%

(2018-04-24 21:00:00+02:00, 2017-03-21 21:00:00...                            5.1%                                      -2.1%

代码如下:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
from datetime import datetime
from tzlocal import get_localzone
import time


class DataEngine:
    def __init__(self):
        self.urls = open(r"C:\Users\Sayed\Desktop\sample.txt").readlines()
        self.driver = webdriver.Chrome(r"D:\Projects\Tutorial\Driver\chromedriver.exe")
        self.wait = WebDriverWait(self.driver, 10)
        self.time = time.time()

    def title(self):
        names = []
        for url in self.urls:
            self.driver.get(url)
            title = self.driver.title
            names.append(title)
        return names

    def table(self):
        DataFrames = []
        for url in self.urls:
            self.driver.get(url)
            while True:
                try:
                    item = self.wait.until(
                        ec.visibility_of_element_located((By.XPATH, '//*[contains(@id,"showMoreHistory")]/a')))
                    self.driver.execute_script("arguments[0].click();", item)
                except Exception:
                    break

            df = pd.DataFrame(columns=['Release Date', 'Time', 'Actual', 'Forecast', 'Previous'])
            pos = 0
            for table in self.wait.until(
                    ec.visibility_of_all_elements_located((By.XPATH, '//*[contains(@id,"eventHistoryTable")]//tr'))):
                data = [item.text for item in table.find_elements_by_xpath(".//*[self::td]")]
                if data:
                    df.loc[pos] = data[0:5]
                    pos += 1
            df = df.head(10)
            DataFrames.append(df)
        return DataFrames

    def date(self):

        dfs = []
        tables = self.table()
        for df in tables:
            Dates = []
            df["Date"] = df["Release Date"].apply(lambda x: x[:12]) + " " + df["Time"]
            for date in df["Date"]:
                date = datetime.strptime(date.strip(), '%b %d, %Y %H:%M')
                Dates.append(date)
            df["Date"] = Dates
            df['Date'] = df['Date'].dt.tz_localize('US/Eastern').dt.tz_convert(get_localzone())
            df = df[['Date', 'Actual', 'Forecast', 'Previous', 'Release Date', 'Time']]
            df = df.drop(df.columns[-4:], axis=1).reset_index(drop=True)

            dfs.append(df)

        return dfs

    def rename(self):
        FinalDataFrames = []
        tables = self.date()
        names = self.title()
        for name, table in zip(names, tables):
            table.rename(columns={'Date': 'Release Date', 'Actual': name}, inplace=True)
            table['Release Date'] = pd.to_datetime(table['Release Date'])
            FinalDataFrames.append(table)
        return FinalDataFrames

    def update(self):
        dfs = self.rename()
        for df in dfs:
            last_read = df.iloc[0, 0]
            latest_release_date = self.driver.find_element_by_xpath('//*[@id="releaseInfo"]/span[1]/div').text
            latest_release_time = self.driver.find_elements_by_css_selector('td.left')[1].text
            latest = latest_release_date + ' ' + latest_release_time
            latest = pd.to_datetime(latest)
            latest_release = latest.tz_localize('US/Eastern').tz_convert(get_localzone())
            if last_read == latest_release:
                pass
            else:
                self.rename()

    def final_df(self):
        self.update()
        while True:
            dfs = self.rename()
            df = pd.concat(dfs, axis=1, join='outer')
            df = df.set_index('Release Date')
            df = df.sort_index(ascending=True)
            print('fin', time.time() - self.time)
            print(df)
            df.to_csv('FinalDF.csv')


if __name__ == "__main__":
    DataEngine().final_df()

1 个答案:

答案 0 :(得分:1)

看起来您正在创建具有从0开始运行的数字索引的数据帧。当您沿着列(axis=1)将它们串联时,Pandas将合并具有相同索引值的记录。您应该将日期设置为串联之前的索引,这将使Pandas有机会合并具有相同日期的记录。

这是一个简化的示例。让我们创建两个带有日期和值的数据框:

>>> df1 = pd.DataFrame([['2018-10-01', 3.1],['2018-10-03', 5.5]],
                       columns=['date','growth %'])
>>> df1
         date  growth %
0  2018-10-01       3.1
1  2018-10-03       5.5
>>> df2 = pd.DataFrame([['2018-10-01', 100],['2018-10-02', 200]],
                       columns=['date','items'])
>>> df2
         date  items
0  2018-10-01    100
1  2018-10-02    200

如果我们直接将它们串联起来,Pandas将合并具有相同索引值的记录,从而导致两者都有两个dates列,并且记录在时间轴上未正确对齐:

>>> pd.concat([df1, df2], axis=1)
         date  growth %        date  items
0  2018-10-01       3.1  2018-10-01    100
1  2018-10-03       5.5  2018-10-02    200

这不是您想要的。

第一步是将每个数据框的date列转换为日期时间对象,并将其设置为索引:

>>> df1['date'] = pd.to_datetime(df1['date'])
>>> df1 = df1.set_index('date')
>>> df1
            growth %
date                
2018-10-01       3.1
2018-10-03       5.5
>>> df2['date'] = pd.to_datetime(df2['date'])
>>> df2 = df2.set_index('date')
>>> df2
            items
date             
2018-10-01    100
2018-10-02    200

串联现在可以正常工作:

>>> pd.concat([df1, df2], axis=1)
            growth %  items
date                       
2018-10-01       3.1  100.0
2018-10-02       NaN  200.0
2018-10-03       5.5    NaN

您实际上不需要将日期列转换为datetime。它也适用于字符串:

>>> df1 = pd.DataFrame(...)
>>> df2 = pd.DataFrame(...)
>>> pd.concat([df1.set_index('date'), df2.set_index('date')], axis=1)
            growth %  items
2018-10-01       3.1  100.0
2018-10-02       NaN  200.0
2018-10-03       5.5    NaN

所需要做的就是每个数据帧都按日期索引。尽管Datetime索引允许对时间序列进行切片和重新采样。