根据日期将数据拆分为一半

时间:2017-03-06 15:03:23

标签: python pandas numpy

我想将我的数据分成两半。因此,在我的示例数据中,我需要将结果分成两个独立的数据帧,一个是每年的前50%,另一个是另一半。附加条件是50%需要基于列'LG'。

任何人都可以帮我吗?

示例数据:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'LG' : ('AR1', 'AR1', 'AR1', 'AR1', 'AR1', 'AR1', 'PO1',  'PO1', 'AR1', 'AR1', 'PO1', 'PO1'),
     'Date': ('2011-1-1', '2011-3-1',  '2011-4-1', '2011-2-1', '2012-1-1', '2012-2-1', '2012-1-1', '2012-2-1', '2013-1-1', '2013-2-1', '2013-1-1', '2013-2-1'),
     'Year': (2011, 2011, 2011, 2011, 2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013)})

pd.to_datetime(df['Date'])

DF:

         Date   LG  Year
0  2011-01-01  AR1  2011
1  2011-03-01  AR1  2011
2  2011-04-01  AR1  2011
3  2011-02-01  AR1  2011
4  2012-01-01  AR1  2012
5  2012-02-01  AR1  2012
6  2012-01-01  PO1  2012
7  2012-02-01  PO1  2012
8  2013-01-01  AR1  2013
9  2013-02-01  AR1  2013
10 2013-01-01  PO1  2013
11 2013-02-01  PO1  2013

1 个答案:

答案 0 :(得分:1)

YearLG上分组后,将相框拆分为一半。基本思路是在组中查找小于组大小50%的位置

<强>代码:

# group by 'Year' and 'LG'
idx = ['Year', 'LG']

# build a grouper
group_by = df.groupby(idx, as_index=False)

# need frame to re-expand the group size
df1 = df.set_index(idx)
df1['g_size'] = group_by.size()

# find the rows in the top half of respective group
top_half = (group_by.cumcount() / df1.g_size.values).values < 0.5

# build new data frames
top = df.loc[top_half]
bot = df.loc[~top_half]

日期排序代码:

如果框架需要在拆分前按日期排序,但不希望排序在原始DataFrame中...

# group by 'Year' and 'LG'
idx = ['Year', 'LG']

# sort by date
df1 = df.sort('Date')

# build a grouper
group_by = df1.groupby(idx, as_index=False)

# Need to set the index to match the result of groupby.size()
df1 = df1.set_index(idx)
df1['g_size'] = group_by.size()

# find the rows in the top half of respective group
top_half = (group_by.cumcount() / df1.g_size.values).values < 0.5

# build new data frames
top = df1.loc[top_half].drop('g_size', axis=1).reset_index()
bot = df1.loc[~top_half].drop('g_size', axis=1).reset_index()

测试代码:

print(df)
print('-- top')
print(top)
print('-- bot')
print(bot)
print('--')

排序结果:

        Date   LG  Year
0   2011-1-1  AR1  2011
1   2011-3-1  AR1  2011
2   2011-4-1  AR1  2011
3   2011-2-1  AR1  2011
4   2012-1-1  AR1  2012
5   2012-2-1  AR1  2012
6   2012-1-1  PO1  2012
7   2012-2-1  PO1  2012
8   2013-1-1  AR1  2013
9   2013-2-1  AR1  2013
10  2013-1-1  PO1  2013
11  2013-2-1  PO1  2013
-- top
   Year   LG      Date
0  2011  AR1  2011-1-1
1  2011  AR1  2011-2-1
2  2012  AR1  2012-1-1
3  2012  PO1  2012-1-1
4  2013  AR1  2013-1-1
5  2013  PO1  2013-1-1
-- bot
   Year   LG      Date
0  2011  AR1  2011-3-1
1  2011  AR1  2011-4-1
2  2012  AR1  2012-2-1
3  2012  PO1  2012-2-1
4  2013  AR1  2013-2-1
5  2013  PO1  2013-2-1

测试数据:

df = pd.DataFrame({
    'LG': ('AR1', 'AR1', 'AR1', 'AR1', 'AR1', 'AR1',
           'PO1', 'PO1', 'AR1', 'AR1', 'PO1', 'PO1'),
    'Date': ('2011-1-1', '2011-3-1', '2011-4-1', '2011-2-1', '2012-1-1',
             '2012-2-1', '2012-1-1', '2012-2-1', '2013-1-1', '2013-2-1',
             '2013-1-1', '2013-2-1'),
    'Year': (2011, 2011, 2011, 2011, 2012, 2012, 2012, 2012, 2013,
             2013, 2013, 2013)
})
pd.to_datetime(df['Date'])