将pandas数据框中的多行折叠到一个数组中

时间:2014-06-18 03:13:52

标签: python pandas

我们说我有一个看起来像这样的DataFrame:

In [41]: df.columns
Out[41]: Index([u'Date Time', u'Open', u'High', u'Low', u'Last'], dtype='object')

In [42]: df
Out[42]: 
              Date Time     Open     High      Low     Last
0   12/02/2007 23:23:00  1443.75  1444.00  1443.75  1444.00
1   12/02/2007 23:25:00  1444.00  1444.00  1444.00  1444.00
2   12/02/2007 23:26:00  1444.25  1444.25  1444.25  1444.25
3   12/02/2007 23:27:00  1444.25  1444.25  1444.25  1444.25
4   12/02/2007 23:28:00  1444.25  1444.25  1444.25  1444.25
5   12/02/2007 23:29:00  1444.25  1444.25  1444.00  1444.00
6   12/02/2007 23:30:00  1444.25  1444.25  1444.00  1444.00
7   12/02/2007 23:31:00  1444.25  1444.25  1443.75  1444.00
8   12/02/2007 23:32:00  1444.00  1444.00  1443.75  1443.75
9   12/02/2007 23:33:00  1444.00  1444.00  1443.50  1443.50

我想创建一个关联'日期时间'当前索引的列,包含此列和前面n个索引的其余列。例如,当index = 9和n = 2时,目标结果将转换这些行:

7   12/02/2007 23:31:00  1444.25  1444.25  1443.75  1444.00
8   12/02/2007 23:32:00  1444.00  1444.00  1443.75  1443.75
9   12/02/2007 23:33:00  1444.00  1444.00  1443.50  1443.50

进入具有以下值的列表,其中索引1-4来自第9行,5-8来自第8行,9-12来自第7行:

['12/02/2007 23:33:00', 1444.00, 1444.00, 1443.50, 1443.50, 1444.00, 1444.00, 1443.75, 1443.75, 1444.25, 1444.25, 1443.75, 1444.00]

我确信我可以轻松地迭代数据帧的片段并创建数组,但我希望有一种更有效的方法。

编辑:

以下是一些生成我正在寻找的结果的代码。一些响应表明我可能会查看rolling_apply或rolling_window函数,但我无法弄清楚它是如何工作的。

import pandas as pd
import numpy as np

data = pd.DataFrame([
    ['12/02/2007 23:23:00', 1443.75,  1444.00, 1443.75, 1444.00],
    ['12/02/2007 23:25:00', 1444.00,  1444.00, 1444.00, 1444.00],
    ['12/02/2007 23:26:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:27:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:28:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:29:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:30:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:31:00', 1444.25,  1444.25, 1443.75, 1444.00],
    ['12/02/2007 23:32:00', 1444.00,  1444.00, 1443.75, 1443.75],
    ['12/02/2007 23:33:00', 1444.00,  1444.00, 1443.50, 1443.50]
])

window_size = 6

# Prime the DataFrame using the date as the index
result = pd.DataFrame(
    [data.iloc[0:window_size, 1:].values.flatten()],
    [data.iloc[window_size - 1, 0]])

for t in data.iloc[window_size:, 1:].itertuples(index=True):
    # drop the oldest values and append the new ones
    new_features = result.tail(1).iloc[:, 4:].values.flatten()
    new_features = np.append(new_features, list(t[1:]), 0)
    # turn it into a DataFrame and append it to the ongoing result
    new_df = pd.DataFrame([new_features], [t[0]])
    result = result.append(new_df)

这种方法不是很快,所以我仍然对如何改进它感兴趣。

2 个答案:

答案 0 :(得分:0)

这个简单的功能对我有用

import itertools
def collapse(df, index_loc, number):
    return list(itertools.chain(*[list(df.loc[x].values) for x in xrange(index_loc - number, index_loc + 1)]))

其中df是你的数据框,index_loc是起始索引(假设你在示例中有整数索引),数字是你的' n'。只需使用values方法获取每个索引点的数据框值,然后将列表链接在一起....

答案 1 :(得分:0)

以下是一些生成我正在寻找的结果的代码。一些响应表明我可能会查看rolling_apply或rolling_window函数,但我无法弄清楚它是如何工作的。

import pandas as pd
import numpy as np

data = pd.DataFrame([
    ['12/02/2007 23:23:00', 1443.75,  1444.00, 1443.75, 1444.00],
    ['12/02/2007 23:25:00', 1444.00,  1444.00, 1444.00, 1444.00],
    ['12/02/2007 23:26:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:27:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:28:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:29:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:30:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:31:00', 1444.25,  1444.25, 1443.75, 1444.00],
    ['12/02/2007 23:32:00', 1444.00,  1444.00, 1443.75, 1443.75],
    ['12/02/2007 23:33:00', 1444.00,  1444.00, 1443.50, 1443.50]
])

window_size = 6

# Prime the DataFrame using the date as the index
result = pd.DataFrame(
    [data.iloc[0:window_size, 1:].values.flatten()],
    [data.iloc[window_size - 1, 0]])

for t in data.iloc[window_size:, 1:].itertuples(index=True):
    # drop the oldest values and append the new ones
    new_features = result.tail(1).iloc[:, 4:].values.flatten()
    new_features = np.append(new_features, list(t[1:]), 0)
    # turn it into a DataFrame and append it to the ongoing result
    new_df = pd.DataFrame([new_features], [t[0]])
    result = result.append(new_df)

这可能不是非常有效,但它解决了这个问题。