此代码的目的是抓取一堆URL,然后从每个URL中提取数据表。
将表格转换为pandas数据框,确定日期,并删除不必要的列,重命名这些列,然后将它们全部合并为一个统一的数据框,并以日期作为索引,以便按日期对数据进行排序,因此同时发生的事件应该在同一行中。 串联前的原始数据:
Release Date Argentina Economic Activity YoY
0 2018-10-25 21:00:00+02:00 -1.6%
1 2018-09-26 21:00:00+02:00 -2.7%
2 2018-08-23 21:00:00+02:00 -6.7%
3 2018-07-24 21:00:00+02:00 -5.8%
4 2018-06-26 21:00:00+02:00 -0.9%
Release Date Argentina Gross Domestic Product (GDP) YoY
0 2018-09-19 22:00:00+02:00 -4.2%
1 2018-06-19 21:00:00+02:00 3.6%
2 2018-03-21 21:00:00+02:00 3.9%
3 2017-12-20 22:00:00+02:00 4.2%
4 2017-09-21 21:00:00+02:00 2.7%
但是在串联之后发生的是,不同的日期在同一行中,所以可以说它们是3个表,我会在第一行中找到三个日期,然后在第二行中找到......等等。
赞:
2018-01-24 22:00:00+02:00, 2016-06-29 21:00:00... 3.9% 0.5%
(2018-02-28 22:00:00+02:00, 2016-09-22 21:00:00... 2.0% -3.4%
(2018-03-28 21:00:00+02:00, 2016-12-22 22:00:00... 4.1% -3.8%
(2018-04-24 21:00:00+02:00, 2017-03-21 21:00:00... 5.1% -2.1%
代码如下:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
from datetime import datetime
from tzlocal import get_localzone
import time
class DataEngine:
def __init__(self):
self.urls = open(r"C:\Users\Sayed\Desktop\sample.txt").readlines()
self.driver = webdriver.Chrome(r"D:\Projects\Tutorial\Driver\chromedriver.exe")
self.wait = WebDriverWait(self.driver, 10)
self.time = time.time()
def title(self):
names = []
for url in self.urls:
self.driver.get(url)
title = self.driver.title
names.append(title)
return names
def table(self):
DataFrames = []
for url in self.urls:
self.driver.get(url)
while True:
try:
item = self.wait.until(
ec.visibility_of_element_located((By.XPATH, '//*[contains(@id,"showMoreHistory")]/a')))
self.driver.execute_script("arguments[0].click();", item)
except Exception:
break
df = pd.DataFrame(columns=['Release Date', 'Time', 'Actual', 'Forecast', 'Previous'])
pos = 0
for table in self.wait.until(
ec.visibility_of_all_elements_located((By.XPATH, '//*[contains(@id,"eventHistoryTable")]//tr'))):
data = [item.text for item in table.find_elements_by_xpath(".//*[self::td]")]
if data:
df.loc[pos] = data[0:5]
pos += 1
df = df.head(10)
DataFrames.append(df)
return DataFrames
def date(self):
dfs = []
tables = self.table()
for df in tables:
Dates = []
df["Date"] = df["Release Date"].apply(lambda x: x[:12]) + " " + df["Time"]
for date in df["Date"]:
date = datetime.strptime(date.strip(), '%b %d, %Y %H:%M')
Dates.append(date)
df["Date"] = Dates
df['Date'] = df['Date'].dt.tz_localize('US/Eastern').dt.tz_convert(get_localzone())
df = df[['Date', 'Actual', 'Forecast', 'Previous', 'Release Date', 'Time']]
df = df.drop(df.columns[-4:], axis=1).reset_index(drop=True)
dfs.append(df)
return dfs
def rename(self):
FinalDataFrames = []
tables = self.date()
names = self.title()
for name, table in zip(names, tables):
table.rename(columns={'Date': 'Release Date', 'Actual': name}, inplace=True)
table['Release Date'] = pd.to_datetime(table['Release Date'])
FinalDataFrames.append(table)
return FinalDataFrames
def update(self):
dfs = self.rename()
for df in dfs:
last_read = df.iloc[0, 0]
latest_release_date = self.driver.find_element_by_xpath('//*[@id="releaseInfo"]/span[1]/div').text
latest_release_time = self.driver.find_elements_by_css_selector('td.left')[1].text
latest = latest_release_date + ' ' + latest_release_time
latest = pd.to_datetime(latest)
latest_release = latest.tz_localize('US/Eastern').tz_convert(get_localzone())
if last_read == latest_release:
pass
else:
self.rename()
def final_df(self):
self.update()
while True:
dfs = self.rename()
df = pd.concat(dfs, axis=1, join='outer')
df = df.set_index('Release Date')
df = df.sort_index(ascending=True)
print('fin', time.time() - self.time)
print(df)
df.to_csv('FinalDF.csv')
if __name__ == "__main__":
DataEngine().final_df()
答案 0 :(得分:1)
看起来您正在创建具有从0开始运行的数字索引的数据帧。当您沿着列(axis=1
)将它们串联时,Pandas将合并具有相同索引值的记录。您应该将日期设置为串联之前的索引,这将使Pandas有机会合并具有相同日期的记录。
这是一个简化的示例。让我们创建两个带有日期和值的数据框:
>>> df1 = pd.DataFrame([['2018-10-01', 3.1],['2018-10-03', 5.5]],
columns=['date','growth %'])
>>> df1
date growth %
0 2018-10-01 3.1
1 2018-10-03 5.5
>>> df2 = pd.DataFrame([['2018-10-01', 100],['2018-10-02', 200]],
columns=['date','items'])
>>> df2
date items
0 2018-10-01 100
1 2018-10-02 200
如果我们直接将它们串联起来,Pandas将合并具有相同索引值的记录,从而导致两者都有两个dates
列,并且记录在时间轴上未正确对齐:
>>> pd.concat([df1, df2], axis=1)
date growth % date items
0 2018-10-01 3.1 2018-10-01 100
1 2018-10-03 5.5 2018-10-02 200
这不是您想要的。
第一步是将每个数据框的date
列转换为日期时间对象,并将其设置为索引:
>>> df1['date'] = pd.to_datetime(df1['date'])
>>> df1 = df1.set_index('date')
>>> df1
growth %
date
2018-10-01 3.1
2018-10-03 5.5
>>> df2['date'] = pd.to_datetime(df2['date'])
>>> df2 = df2.set_index('date')
>>> df2
items
date
2018-10-01 100
2018-10-02 200
串联现在可以正常工作:
>>> pd.concat([df1, df2], axis=1)
growth % items
date
2018-10-01 3.1 100.0
2018-10-02 NaN 200.0
2018-10-03 5.5 NaN
您实际上不需要将日期列转换为datetime。它也适用于字符串:
>>> df1 = pd.DataFrame(...)
>>> df2 = pd.DataFrame(...)
>>> pd.concat([df1.set_index('date'), df2.set_index('date')], axis=1)
growth % items
2018-10-01 3.1 100.0
2018-10-02 NaN 200.0
2018-10-03 5.5 NaN
所需要做的就是每个数据帧都按日期索引。尽管Datetime索引允许对时间序列进行切片和重新采样。