更快地遍历 xarray 和数据帧

我是 Python 新手,不了解所有方面。

我想遍历 dataframe (2D) 并将其中一些值分配给 xarray (3D)。 我的 xarray 的坐标是公司股票代码 (1)、财务变量 (2) 和每日日期 (3)。 每家公司的 dataframe 列是一些与 xarray 中相同的财务变量,索引由季度日期组成。

我的目标是为每个公司取一个已经生成的 dataframe 并在某个变量的列和某个日期的行中查找一个值,并将其分配给 xarray 中的相应位置{1}}。

由于某些日期不会出现在 dataframe 的索引中(每个日历年只有 4 个日期),我想为 xarrayxarray 上一个日期的值,如果该值也不为 0。 我曾尝试使用嵌套的 for 循环来完成此操作,但仅在一个变量中遍历所有日期大约需要 20 秒。

我的日期列表如果由大约 8000 个日期组成,变量列表有大约 30 个变量,公司列表大约有 800 个公司。 如果我要循环所有这些,我将需要几天时间才能完成嵌套的 for 循环。 有没有更快的方法将这些值分配给 xarray?我的猜测类似于 iterrows()iteritems(),但在 xarray 中。 这是我的程序的示例代码,其中包含公司和变量的较短列表:

import pandas as pd
from datetime import datetime, date, timedelta
import numpy as np
import xarray as xr
import time

start_time = time.time()

# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df. 
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents', 'shortTermInvestments', 'cashAndShortTermInvestments', 'totalAssets',
        'totalLiabilities', 'totalStockholdersEquity', 'netIncome', 'freeCashFlow']
rows = []
for year in range(1989, 2020):
    for month, day in zip([3, 6, 9, 12], [31, 30, 30, 31]):
        rows.append(date(year, month, day))
a = np.random.randint(100, size=(len(rows), len(cols)))
df = pd.DataFrame(data=a, columns=cols)
df.insert(column='date', value=rows, loc=0)
# This is just to set the date format so that I can later look up the values
for item, i in zip(df.iloc[:, 0], range(len(df.iloc[:, 0]))):
    df.iloc[i, 0] = datetime.strptime(str(item), '%Y-%m-%d')
df.set_index('date', inplace=True)

# Coordinates for the xarray:
companies = ['AAPL']  # This is actually longer (around 800 companies), but for the sake of the question, it is limited to just one company.
variables = ['totalAssets', 'totalLiabilities', 'totalStockholdersEquity']  # Same as with the companies (around 30 variables).
first_date = date(1998, 3, 25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date, end=last_date).tolist()

# We create a zero xarray, so that we can later fill it up with values:
z = np.zeros((len(companies), len(variables), len(dates)))
ds = xr.DataArray(z, coords=[companies, variables, dates],
                  dims=['companies', 'variables', 'dates'])

# We assign values from the df to the ds
for company in companies:
    for variable in variables:
        first_value_found = False
        for date in dates:
            # Dates in the df are quarterly dates and dates in the ds are daily dates.
            # We start off by looking for a certain date in the df. If we dont find it, we give it the value 0 in the ds
            # If we do find it, we assign it the value found in the df and tell it that the first value has been found
            # Now that the first value has been found, when we dont find a value in the df, instead of giving it a value of 0, we give it the value of the last date.
            if first_value_found == False:
                    ds.loc[company, variable, date] = df.loc[date, variable]
                    first_value_found = True
                    ds.loc[company, variable, date] = 0
                    ds.loc[company, variable, date] = df.loc[date, variable]
                    ds.loc[company, variable, date] = ds.loc[company, variable, date + timedelta(-1)]

print("My program took", time.time() - start_time, "to run")

主要问题在于 for 循环,因为我已经在单独的文件上测试过这些循环,而且这些似乎是最耗时的。

一种可能的策略是遍历 DataFrame 的实际索引,而不是所有可能的索引

avail_dates = df.index
for date in avail_dates:
    # Copy the data

没错,您可以使用列表对 DataArray 和 DataFrame 进行索引。 (另外我不会使用 da.loc[company, variables, date:] = df.loc[date, variables] 作为来自 ds 的东西的变量名而不是 xarray

不过,您可能想要使用的是 pandas.DataFrame.reindex()

