在两个日期列之间给定交替日期列和值的矢量化数据堆叠

时间:2018-03-28 15:08:33

标签: python pandas numpy dataframe

我的记录看起来像这样。消费价值介于两列之间,例如从8月17日(第13栏)到9月17日(第11栏)的消费量为6.1(第12栏),依此类推。

INSTALL METER_NO    Field3  Field4  Field5  Field6  Field7  Field8  Field9  Field10 Field11 Field12 Field13
80000000    19151882    1-Jan-18    5.6 1-Dec-17    7.9 1-Nov-17    5.5 1-Oct-17    4.4 1-Sep-17    6.1 1-Aug-17
80000001    31692087    1-Jan-18    55.5    1-Dec-17    62.7    1-Nov-17    2.2 1-Oct-17    2   1-Sep-17    9.3 1-Aug-17
80000003    MISSING 1-Jan-18    0   1-Dec-17    0   1-Nov-17    0   1-Oct-17    0   1-Sep-17    0   1-Aug-17
80000004    98914998    1-Jan-18    8.6 1-Dec-17    19.4    1-Nov-17    7.5 1-Oct-17    5.4 1-Sep-17    6.8 1-Aug-17
80000005    48962501    1-Jan-18    1   1-Dec-17    1.3 1-Nov-17    1.8 1-Oct-17    1.7 1-Sep-17    2.7 1-Aug-17
80000006    14954563    1-Jan-18    0   1-Dec-17    0   1-Nov-17    0   1-Oct-17    0   1-Sep-17    0   1-Aug-17

我试图以这种格式获取它们:

Install Meter_NO    From    To  Consumption
80000000    19151882    8/1/2017    9/1/2017    6.1
80000000    19151882    9/1/2017    10/1/2017   4.4
80000000    19151882    10/1/2017   11/1/2017   5.5
80000000    19151882    11/1/2017   12/1/2017   7.9
80000000    19151882    12/1/2017   1/1/2018    5.6
....

有没有办法在不迭代数据框的情况下执行此操作?

1 个答案:

答案 0 :(得分:0)

考虑为每隔一对日期和值字段连接列子集。 FROM TO 日期需要进行一些争论。

数据

import pandas as pd
from io import StringIO

txt = """
INSTALL METER_NO    Field3  Field4  Field5  Field6  Field7  Field8  Field9  Field10 Field11 Field12 Field13
80000000    19151882    "1-Jan-18"    5.6 "1-Dec-17"    7.9 "1-Nov-17"    5.5 "1-Oct-17"    4.4 "1-Sep-17"    6.1 "1-Aug-17"
80000001    31692087    "1-Jan-18"    55.5    "1-Dec-17"    62.7    "1-Nov-17"    2.2 "1-Oct-17"    2   "1-Sep-17"    9.3 "1-Aug-17"
80000003    MISSING "1-Jan-18"    0   "1-Dec-17"    0   "1-Nov-17"    0   "1-Oct-17"    0   "1-Sep-17"    0  "1-Aug-17"
80000004    98914998   "1-Jan-18"    8.6 "1-Dec-17"    19.4    "1-Nov-17"    7.5 "1-Oct-17"    5.4 "1-Sep-17"    6.8 "1-Aug-17"
80000005    48962501   "1-Jan-18"    1   "1-Dec-17"    1.3 "1-Nov-17"    1.8 "1-Oct-17"    1.7 "1-Sep-17"    2.7 "1-Aug-17"
80000006    14954563   "1-Jan-18"    0   "1-Dec-17"    0   "1-Nov-17"    0   "1-Oct-17"    0   "1-Sep-17"    0   "1-Aug-17"
"""

df = pd.read_table(StringIO(txt), sep="\s+")

<强>代码

from datetime import timedelta, datetime

needed_cols = list(range(3, len(df.columns), 2))

df_list = []

# BUILD DF SUBSETS
for n in needed_cols:
    tmp = df[df.columns[[0,1]+[n-1, n]]]    
    tmp.columns = ['INSTALL', 'METER_NO', 'FROM', 'CONSUMPTION']

    tmp.loc[:, 'FROM'] = pd.to_datetime(tmp.loc[:, 'FROM'], format='%d-%b-%y')
    tmp.loc[:, 'TO'] = tmp['FROM'].apply(lambda x: datetime(x.year, x.month + 1, 1) \
                                                     if x.month < 12 else datetime(x.year+1, 1, 1)) - timedelta(days=1)

    df_list.append(tmp[['INSTALL', 'METER_NO', 'FROM', 'TO', 'CONSUMPTION']])

# CONCATENATE ALL DFs    
final_df = pd.concat(df_list).sort_values('METER_NO').reset_index(drop=True)

print(final_df.head(10))
#     INSTALL  METER_NO       FROM         TO  CONSUMPTION
# 0  80000006  14954563 2017-09-01 2017-09-30          0.0
# 1  80000006  14954563 2017-10-01 2017-10-31          0.0
# 2  80000006  14954563 2017-11-01 2017-11-30          0.0
# 3  80000006  14954563 2018-01-01 2018-01-31          0.0
# 4  80000006  14954563 2017-12-01 2017-12-31          0.0
# 5  80000000  19151882 2017-09-01 2017-09-30          6.1
# 6  80000000  19151882 2017-10-01 2017-10-31          4.4
# 7  80000000  19151882 2017-11-01 2017-11-30          5.5
# 8  80000000  19151882 2018-01-01 2018-01-31          5.6
# 9  80000000  19151882 2017-12-01 2017-12-31          7.9