我有一个像这样的Pandas DataFrame:
import numpy as np
import pandas as pd
np.random.seed(1234)
midx = pd.MultiIndex.from_product([['a', 'b', 'c'], pd.date_range('20130101', periods=6)], names=['letter', 'date'])
df = pd.DataFrame(np.random.randn(len(midx), 1), index=midx)
该数据框如下所示:
0
letter date
a 2013-01-01 0.471435
2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
2013-01-06 0.887163
b 2013-01-01 0.859588
2013-01-02 -0.636524
2013-01-03 0.015696
2013-01-04 -2.242685
2013-01-05 1.150036
2013-01-06 0.991946
c 2013-01-01 0.953324
2013-01-02 -2.021255
2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
2013-01-06 0.289092
我想要做的是根据日期上的条件保留所有行,这取决于字母。例如,
例如,所有这些信息都可以存储在字典中。
dictionary = {"a": slice("20130102", "20130105"),
"b": "20130103",
"c": slice("20130103", "20130105")}
有一种简单的方法可以用熊猫来计算吗?我没有找到任何有关此类过滤的信息。
答案 0 :(得分:5)
您可以使用query
,它是专为此类选择标准而设计的。
如果您稍微修改dictionary
,可以借助列表理解生成所需的查询:
In : dictionary
Out:
{'a': ('20130102', '20130105'),
'b': ('20130103', '20130103'),
'c': ('20130103', '20130105')}
In : df.query(
' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,)))
for k, v in dictionary.items())
)
Out:
0
letter date
a 2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
b 2013-01-03 0.015696
c 2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
有关查询语句实际执行操作的更多信息,请参阅列表解析的详细信息:
In : (' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,)))
for k, v in dictionary.items()))
Out: "('20130102' <= date <= '20130105' and letter == 'a') or
('20130103' <= date <= '20130105' and letter == 'c') or
('20130103' <= date <= '20130103' and letter == 'b')"
答案 1 :(得分:2)
这是关于此的一种愚蠢的方式,但你可以使用
这一事实传递标签或元组列表的工作方式类似于重新索引[source]
并利用pd.Index.slice_indexer(start, stop)
,它允许您在指定日期之间过滤每个索引。
>>> dictionary = {"a": ("20130102", "20130105"),
... "b": "20130103",
... "c": ("20130103", "20130105")}
...
...
... def get_idx_pairs():
... for lvl0, lvl1 in df.index.groupby(df.index.get_level_values(0)).items():
... dates = lvl1.levels[1]
... dt = dictionary[lvl0]
... if isinstance(dt, (tuple, list)):
... slices = dates[dates.slice_indexer(dt[0], dt[1])]
... for s in slices:
... yield (lvl0, s)
... else:
... yield (lvl0, dt)
...
...
... df.loc[list(get_idx_pairs())]
...
0
letter date
a 2013-01-02 -1.1910
2013-01-03 1.4327
2013-01-04 -0.3127
2013-01-05 -0.7206
b 2013-01-03 0.0157
c 2013-01-03 -0.3341
2013-01-04 0.0021
2013-01-05 0.4055
对于每个&#34;较小的&#34;在date
中的DatetimeIndex,您将其约束到指定的切片,然后构造明确索引的(letter, date)
元组。
或者,如果您可以将日期指定为元组(对于单个日期,只需重复),您可以稍微压缩辅助函数:
>>> dates = (("20130102", "20130105"),
... ("20130103", "20130103"),
... ("20130103", "20130105"))
...
... def get_idx_pairs(df, dates):
... letters = df.index.get_level_values(0)
... for (k, v), (start, stop) in zip(df.index.groupby(letters).items(), dates):
... dates = v.levels[1]
... sliced = dates[dates.slice_indexer(start, stop)]
... for s in sliced:
... yield k, s
...
... df.loc[list(get_idx_pairs(df, dates))]
...
0
letter date
a 2013-01-02 -1.1910
2013-01-03 1.4327
2013-01-04 -0.3127
2013-01-05 -0.7206
b 2013-01-03 0.0157
c 2013-01-03 -0.3341
2013-01-04 0.0021
2013-01-05 0.4055
答案 2 :(得分:1)
通过对原始字典的小改动,我们可以更简洁地做到这一点。我们可以在列表推导中使用pd.IndexSlice
,然后使用pd.concat
,
# add `-` to separate dates
dictionary = {"a": slice("2013-01-02", "2013-01-05"),
"b": "2013-01-03",
"c": slice("2013-01-03", "2013-01-05")}
dictionary = OrderedDict(sorted(dictionary.items()))
idx_slices = [pd.IndexSlice[k, v] for k, v in dictionary.items()]
pd.concat([df.loc[idx, :] for idx in idx_slices])
Out[1]:
0
letter date
a 2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
c 2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
b 2013-01-03 0.015696
如果您希望自动添加-
,可以使用datetime
,如下所示,
dt.datetime.strptime('20170121', '%Y%m%d').strftime('%Y-%m-%d')
答案 3 :(得分:1)
最简单的方法是将函数应用于pandas DataFrameGroupBy对象,这是一个例子:
dictionary = {"a": slice("20130102", "20130105"),
"b": slice("20130103", "20130103"),
"c": slice("20130103", "20130105")}
def date_condition(group, dictionary):
return group.xs(group.name).loc[dictionary[group.name]]
df.groupby(level=0).apply(date_condition, dictionary)
Output[0]:
0
letter date
a 2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
b 2013-01-03 0.015696
c 2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
注意&#34; b&#34;重复日期以强制.loc
返回DataFrame而不是系列