如何从多索引数据框中提取特定数据(雅虎股票数据)

时间:2018-09-23 21:18:00

标签: python pandas multi-index

有人可以在下面的多索引数据框中获取特定数据点时给我一个快速/清晰的课程吗?我整天都在看教程,但是都没有什么帮助。对于认识熊猫的人来说,这应该很简单。

如何执行以下操作:

  1. 在数据框的最后日期提取“ AAPL”的“关闭”

  2. 如果特定日期的“关闭”>“ AAPL”的“开放”,则提取“ AAPL”的所有数据并添加到新的数据框中

  3. 为每个符号(AAPL,FB)添加一个新列,其标签为“范围”,并且每天为“高”-“低”

symbol      AAPL                                                FB
ohlcv       open    high    low     close   adj     volume      open    high    low     close   adj     volume
      Date                                              
2018-09-17  222.15  222.95  217.27  217.88  217.88  37195100    161.92  162.06  159.77  160.58  160.58  21005300
2018-09-18  217.79  221.85  217.12  218.24  218.24  31571700    159.39  161.76  158.87  160.30  160.30  22465200
2018-09-19  218.50  219.62  215.30  218.37  218.37  27123800    160.08  163.44  159.48  163.06  163.06  19629000
2018-09-20  220.24  222.28  219.15  220.03  220.03  26460800    164.50  166.45  164.47  166.02  166.02  18824200
2018-09-21  220.78  221.36  217.29  217.66  217.66  96246748    166.64  167.25  162.81  162.93  162.93  25956794

此处是数据框的字典,是以下要求的注释之一,

df = pd.DataFrame({('AAPL', 'adj_close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 170.3,
  pd.Timestamp('2018-01-03 00:00:00'): 170.27,
  pd.Timestamp('2018-01-04 00:00:00'): 171.07,
  pd.Timestamp('2018-01-05 00:00:00'): 173.01,
  pd.Timestamp('2018-01-08 00:00:00'): 172.37},
 ('AAPL', 'close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 172.26,
  pd.Timestamp('2018-01-03 00:00:00'): 172.23,
  pd.Timestamp('2018-01-04 00:00:00'): 173.03,
  pd.Timestamp('2018-01-05 00:00:00'): 175.0,
  pd.Timestamp('2018-01-08 00:00:00'): 174.35},
 ('AAPL', 'high'): {
  pd.Timestamp('2018-01-02 00:00:00'): 172.3,
  pd.Timestamp('2018-01-03 00:00:00'): 174.55,
  pd.Timestamp('2018-01-04 00:00:00'): 173.47,
  pd.Timestamp('2018-01-05 00:00:00'): 175.37,
  pd.Timestamp('2018-01-08 00:00:00'): 175.61},
 ('AAPL', 'low'): {
  pd.Timestamp('2018-01-02 00:00:00'): 169.26,
  pd.Timestamp('2018-01-03 00:00:00'): 171.96,
  pd.Timestamp('2018-01-04 00:00:00'): 172.08,
  pd.Timestamp('2018-01-05 00:00:00'): 173.05,
  pd.Timestamp('2018-01-08 00:00:00'): 173.93},
 ('AAPL', 'open'): {
  pd.Timestamp('2018-01-02 00:00:00'): 170.16,
  pd.Timestamp('2018-01-03 00:00:00'): 172.53,
  pd.Timestamp('2018-01-04 00:00:00'): 172.54,
  pd.Timestamp('2018-01-05 00:00:00'): 173.44,
  pd.Timestamp('2018-01-08 00:00:00'): 174.35},
 ('AAPL', 'volume'): {
  pd.Timestamp('2018-01-02 00:00:00'): 25555900,
  pd.Timestamp('2018-01-03 00:00:00'): 29517900,
  pd.Timestamp('2018-01-04 00:00:00'): 22434600,
  pd.Timestamp('2018-01-05 00:00:00'): 23660000,
  pd.Timestamp('2018-01-08 00:00:00'): 20567800},
 ('FB', 'adj_close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.42,
  pd.Timestamp('2018-01-03 00:00:00'): 184.67,
  pd.Timestamp('2018-01-04 00:00:00'): 184.33,
  pd.Timestamp('2018-01-05 00:00:00'): 186.85,
  pd.Timestamp('2018-01-08 00:00:00'): 188.28},
 ('FB', 'close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.42,
  pd.Timestamp('2018-01-03 00:00:00'): 184.67,
  pd.Timestamp('2018-01-04 00:00:00'): 184.33,
  pd.Timestamp('2018-01-05 00:00:00'): 186.85,
  pd.Timestamp('2018-01-08 00:00:00'): 188.28},
 ('FB', 'high'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.58,
  pd.Timestamp('2018-01-03 00:00:00'): 184.78,
  pd.Timestamp('2018-01-04 00:00:00'): 186.21,
  pd.Timestamp('2018-01-05 00:00:00'): 186.9,
  pd.Timestamp('2018-01-08 00:00:00'): 188.9},
 ('FB', 'low'): {
  pd.Timestamp('2018-01-02 00:00:00'): 177.55,
  pd.Timestamp('2018-01-03 00:00:00'): 181.33,
  pd.Timestamp('2018-01-04 00:00:00'): 184.1,
  pd.Timestamp('2018-01-05 00:00:00'): 184.93,
  pd.Timestamp('2018-01-08 00:00:00'): 186.33},
 ('FB', 'open'): {
  pd.Timestamp('2018-01-02 00:00:00'): 177.68,
  pd.Timestamp('2018-01-03 00:00:00'): 181.88,
  pd.Timestamp('2018-01-04 00:00:00'): 184.9,
  pd.Timestamp('2018-01-05 00:00:00'): 185.59,
  pd.Timestamp('2018-01-08 00:00:00'): 187.2},
 ('FB', 'volume'): {
  pd.Timestamp('2018-01-02 00:00:00'): 18151900,
  pd.Timestamp('2018-01-03 00:00:00'): 16886600,
  pd.Timestamp('2018-01-04 00:00:00'): 13880900,
  pd.Timestamp('2018-01-05 00:00:00'): 13574500,
  pd.Timestamp('2018-01-08 00:00:00'): 17994700}})

2 个答案:

答案 0 :(得分:0)

您可以通过建立索引直接从多索引访问列。由于您尚未发布数据框代码,因此可以使用以下代码片段尝试它们是否起作用:

  1. import pygame window = pygame.display.set_mode((1000,1000)) BGImage = pygame.image.load('Plat.jpg') window.blit(BGImage(0,0)) Eggshell = (240,235,220) vel = 15 x = 3 y = 450 width = 50 height= 60 isJump = False jumpCount = 10 run = True while run: pygame.time.delay(100) for event in pygame.event.get(): if event.type == pygame.QUIT: run == False pressed = pygame.key.get_pressed() if pressed[pygame.K_LEFT] and x > vel: x-= vel if pressed[pygame.K_RIGHT] and x < 920 : x+=vel if not (isJump): if pressed[pygame.K_UP] and y > vel: isJump = True else: if jumpCount >= -10: neg = 1 if jumpCount < 0: neg = -1 y -= (jumpCount ** 2) * 0.5 * neg jumpCount -= 1 else: isJump = False jumpCount = 10 window.fill((0,0,0)) pygame.draw.rect(window,Eggshell,(x,y,width,height)) pygame.display.update() pygame.quit() 将为您提供“ AAPL”的“关闭”列。您可以按日期对该列进行排序以提取上一个日期。

    df[('AAPL', 'close')]
  2. 要比较和提取所有“ AAPL”数据,您可以执行以下操作:

    df.sort_values('Date', ascending=False).head(1)[('AAPL', 'close')]
    

    在过滤条件中也添加日期。

  3. 可能有一种更好的方法,但这可能仍然有效:

    df[df[('AAPL', 'close')] > df[('AAPL', 'open')]]['AAPL']
    

您可以像在正常数据框中一样添加日期条件。

答案 1 :(得分:0)

IIUC,

  1. 在数据框的最后日期提取“ AAPL”的“关闭”

只需执行df.index.max()并选择AAPL /关闭,即可获得最长日期

df.loc[df.index.max(), ('AAPL', 'close')]
  1. 如果特定日期的“关闭”>“ AAPL”的“开放”,则提取“ AAPL”的所有数据并添加到新的数据框中

基本上,如果您使用mask进行过滤,则会返回data frame。因此,无需“附加到其他数据框”。

mask = df.loc[:, ('AAPL', 'open')] > df.loc[:, ('AAPL', 'close')]
df.loc[mask[mask].index, ('AAPL')]
  1. 为每个符号(AAPL,FB)添加一个新列,其标签为“范围”,并且每天为“高”-“低”

您只需选择列(ticker, info),其中ticker将是AAPL, FB, ...,而info将是high, close, ...,然后加入即可。

r = df.loc[:, [('AAPL', 'high'), ('FB', 'high')]].sub(df.loc[:, [('AAPL', 'low'), ('FB', 'low')]].values).rename(columns={"high": "range"})
df = df.join(r).sort_index(1)

请注意,您正在使用MultiIndex列。这使得所有操作都更难以编写代码。您可能会考虑使用名为ticker' and values as AAPL , FB等的新列更改为单索引列。

例如,使用stack + reset_index,您将获得

df2 = df.stack(level=0).reset_index().rename(columns={'level_0': 'date', 'level_1': 'ticker'}).sort_values('ticker')

    date    ticker  adj_close   close   high    low     open    range   volume
0   2018-01-02  AAPL    170.30  172.26  172.30  169.26  170.16  3.04    25555900
2   2018-01-03  AAPL    170.27  172.23  174.55  171.96  172.53  2.59    29517900
4   2018-01-04  AAPL    171.07  173.03  173.47  172.08  172.54  1.39    22434600
6   2018-01-05  AAPL    173.01  175.00  175.37  173.05  173.44  2.32    23660000
8   2018-01-08  AAPL    172.37  174.35  175.61  173.93  174.35  1.68    20567800
1   2018-01-02  FB      181.42  181.42  181.58  177.55  177.68  4.03    18151900
3   2018-01-03  FB      184.67  184.67  184.78  181.33  181.88  3.45    16886600
5   2018-01-04  FB      184.33  184.33  186.21  184.10  184.90  2.11    13880900
7   2018-01-05  FB      186.85  186.85  186.90  184.93  185.59  1.97    13574500
9   2018-01-08  FB      188.28  188.28  188.90  186.33  187.20  2.57    17994700

然后,例如,计算range,它要简单得多:

df2['range2'] = df2['high'] - df2['low']