我设计的功能类似于Mathematica中的GatherBy。我认为通过在Pandas中装扮groupby
功能可以轻松完成。此功能将按特定的特征函数对列表进行分组。
设置
time1 = pd.date_range(start=datetime.datetime(2015, 1, 30), end=datetime.datetime(2015, 2, 5))
datedat = np.array([time1, 0.1 * numpy.arange(7), 0.2 * numpy.arange(7)]).T
print(datedat)
array([[Timestamp('2015-01-30 00:00:00', freq='D'), 0.0, 0.0],
[Timestamp('2015-01-31 00:00:00', freq='D'), 0.1, 0.2],
[Timestamp('2015-02-01 00:00:00', freq='D'), 0.2, 0.4],
[Timestamp('2015-02-02 00:00:00', freq='D'), 0.3,
0.6],
[Timestamp('2015-02-03 00:00:00', freq='D'), 0.4, 0.8],
[Timestamp('2015-02-04 00:00:00', freq='D'), 0.5, 1.0],
[Timestamp('2015-02-05 00:00:00', freq='D'), 0.6,
1.2]], dtype=object)
假设我想按年份和月份对其进行分组 - 您会看到1月和2月的数据。所以我设计了一个特色功能:
gatherf = lambda x: ((x[0].year)*1000+x[0].month)
对于每个时间数据记录,此gatherf
会计算groupby
的值以区分时间。
目标
我的最终目标是开发一个函数gather_by
,
gather_by(datedat, gatherf)
应该生成这个:
array([[[Timestamp('2015-01-30 00:00:00', freq='D'), 0.0, 0.0],
[Timestamp('2015-01-31 00:00:00', freq='D'), 0.1, 0.2]],
[[Timestamp('2015-02-01 00:00:00', freq='D'), 0.2, 0.4],
[Timestamp('2015-02-02 00:00:00', freq='D'), 0.3,
0.6],
[Timestamp('2015-02-03 00:00:00', freq='D'), 0.4, 0.8],
[Timestamp('2015-02-04 00:00:00', freq='D'), 0.5, 1.0],
[Timestamp('2015-02-05 00:00:00', freq='D'), 0.6,
1.2]]], dtype=object)
我的努力
在一般情况下,datedat
可能会有超过3的列。我无法逐个对它们进行分组。所以我试过了:
datedatF2 =pandas.DataFrame({'dat':datedat,'gather_key':numpy.array(list(map(gatherf, datedat)))})
与
groupedall=datedatF2['dat'].groupby(datedatF2['gather_key'])
但这会导致Data must be 1-dimensional
错误。我该怎么办?
答案 0 :(得分:2)
输入 -
datedat
array([[Timestamp('2015-01-30 00:00:00', freq='D'), 0.0, 0.0],
[Timestamp('2015-01-31 00:00:00', freq='D'), 0.1, 0.2],
[Timestamp('2015-02-01 00:00:00', freq='D'), 0.2, 0.4],
[Timestamp('2015-02-02 00:00:00', freq='D'), 0.3, 0.6],
[Timestamp('2015-02-03 00:00:00', freq='D'), 0.4, 0.8],
[Timestamp('2015-02-04 00:00:00', freq='D'), 0.5, 1.0],
[Timestamp('2015-02-05 00:00:00', freq='D'), 0.6, 1.2]], dtype=object)
gatherf
lambda x: ((x[0].year) * 1000 + x [0].month)
一种非常可靠的方法,可以根据当前方法对构建进行分组,将自定义列表/密钥传递给groupby
(分组谓词不必属于数据帧!) - < / p>
key = list(map(gatherf, datedat))
r = []
for _, g in pd.DataFrame(datedat).groupby(key):
r.append(g.values.tolist())
或者,作为列表理解 -
r = [g.values.tolist() for _, g in pd.DataFrame(datedat).groupby(key)]
np.array(r)
[[[Timestamp('2015-01-30 00:00:00', freq='D'), 0.0, 0.0],
[Timestamp('2015-01-31 00:00:00', freq='D'), 0.1, 0.2]],
[[Timestamp('2015-02-01 00:00:00', freq='D'), 0.2, 0.4],
[Timestamp('2015-02-02 00:00:00', freq='D'), 0.3, 0.6],
[Timestamp('2015-02-03 00:00:00', freq='D'), 0.4, 0.8],
[Timestamp('2015-02-04 00:00:00', freq='D'), 0.5, 1.0],
[Timestamp('2015-02-05 00:00:00', freq='D'), 0.6, 1.2]]]
这应该适用于任意数量的列,只要gatherf
被适当地编写以匹配。
答案 1 :(得分:1)
我认为你可以通过映射函数groupby
使用gatherf
:
datedatF2 = pd.DataFrame(datedat)
gatherf = lambda x: x[0].year*1000 + x[0].month
out = [x.values.tolist() for i, x in datedatF2.groupby(list(map(gatherf, datedat)))]
print (out)
[[[Timestamp('2015-01-30 00:00:00', freq='D'), 0.0, 0.0],
[Timestamp('2015-01-31 00:00:00', freq='D'), 0.1, 0.2]],
[[Timestamp('2015-02-01 00:00:00', freq='D'), 0.2, 0.4],
[Timestamp('2015-02-02 00:00:00', freq='D'), 0.3, 0.6],
[Timestamp('2015-02-03 00:00:00', freq='D'), 0.4, 0.8],
[Timestamp('2015-02-04 00:00:00', freq='D'), 0.5, 1.0],
[Timestamp('2015-02-05 00:00:00', freq='D'), 0.6, 1.2]]]
Series
的第一个解决方案:
datedatF2 = pd.DataFrame(datedat)
dates = pd.to_datetime(datedatF2.iloc[:, 0])
s = dates.dt.year*1000 + dates.dt.month
print (s)
0 2015001
1 2015001
2 2015002
3 2015002
4 2015002
5 2015002
6 2015002
Name: dat0, dtype: int64
out = [x.values.tolist() for i, x in datedatF2.groupby(s)]
编辑:
第二种方法更快:
N = 100000
df = pd.DataFrame({1:pd.date_range('2015-01-01', periods=N, freq='15H'),
2:np.random.randint(100, size=N),
3:np.random.randint(100, size=N)})
datedat = df.values
In [75]: %%timeit
...: datedatF2 = pd.DataFrame(datedat)
...: dates = pd.to_datetime(datedatF2.iloc[:, 0])
...: s = dates.dt.year*1000 + dates.dt.month
...: out = [x.values.tolist() for i, x in datedatF2.groupby(s)]
...:
1 loop, best of 3: 249 ms per loop
In [76]: %%timeit
...: datedatF2 = pd.DataFrame(datedat)
...: gatherf = lambda x: x[0].year*1000 + x[0].month
...: out = [x.values.tolist() for i, x in datedatF2.groupby(list(map(gatherf, datedat)))]
...:
1 loop, best of 3: 359 ms per loop
<强>买者强>:
性能取决于数据 - DataFrame
的大小和值组的数量。但通常第二种解决方案首先要快。