我的记录看起来像这样。消费价值介于两列之间,例如从8月17日(第13栏)到9月17日(第11栏)的消费量为6.1(第12栏),依此类推。
INSTALL METER_NO Field3 Field4 Field5 Field6 Field7 Field8 Field9 Field10 Field11 Field12 Field13
80000000 19151882 1-Jan-18 5.6 1-Dec-17 7.9 1-Nov-17 5.5 1-Oct-17 4.4 1-Sep-17 6.1 1-Aug-17
80000001 31692087 1-Jan-18 55.5 1-Dec-17 62.7 1-Nov-17 2.2 1-Oct-17 2 1-Sep-17 9.3 1-Aug-17
80000003 MISSING 1-Jan-18 0 1-Dec-17 0 1-Nov-17 0 1-Oct-17 0 1-Sep-17 0 1-Aug-17
80000004 98914998 1-Jan-18 8.6 1-Dec-17 19.4 1-Nov-17 7.5 1-Oct-17 5.4 1-Sep-17 6.8 1-Aug-17
80000005 48962501 1-Jan-18 1 1-Dec-17 1.3 1-Nov-17 1.8 1-Oct-17 1.7 1-Sep-17 2.7 1-Aug-17
80000006 14954563 1-Jan-18 0 1-Dec-17 0 1-Nov-17 0 1-Oct-17 0 1-Sep-17 0 1-Aug-17
我试图以这种格式获取它们:
Install Meter_NO From To Consumption
80000000 19151882 8/1/2017 9/1/2017 6.1
80000000 19151882 9/1/2017 10/1/2017 4.4
80000000 19151882 10/1/2017 11/1/2017 5.5
80000000 19151882 11/1/2017 12/1/2017 7.9
80000000 19151882 12/1/2017 1/1/2018 5.6
....
有没有办法在不迭代数据框的情况下执行此操作?
答案 0 :(得分:0)
考虑为每隔一对日期和值字段连接列子集。 FROM 和 TO 日期需要进行一些争论。
数据强>
import pandas as pd
from io import StringIO
txt = """
INSTALL METER_NO Field3 Field4 Field5 Field6 Field7 Field8 Field9 Field10 Field11 Field12 Field13
80000000 19151882 "1-Jan-18" 5.6 "1-Dec-17" 7.9 "1-Nov-17" 5.5 "1-Oct-17" 4.4 "1-Sep-17" 6.1 "1-Aug-17"
80000001 31692087 "1-Jan-18" 55.5 "1-Dec-17" 62.7 "1-Nov-17" 2.2 "1-Oct-17" 2 "1-Sep-17" 9.3 "1-Aug-17"
80000003 MISSING "1-Jan-18" 0 "1-Dec-17" 0 "1-Nov-17" 0 "1-Oct-17" 0 "1-Sep-17" 0 "1-Aug-17"
80000004 98914998 "1-Jan-18" 8.6 "1-Dec-17" 19.4 "1-Nov-17" 7.5 "1-Oct-17" 5.4 "1-Sep-17" 6.8 "1-Aug-17"
80000005 48962501 "1-Jan-18" 1 "1-Dec-17" 1.3 "1-Nov-17" 1.8 "1-Oct-17" 1.7 "1-Sep-17" 2.7 "1-Aug-17"
80000006 14954563 "1-Jan-18" 0 "1-Dec-17" 0 "1-Nov-17" 0 "1-Oct-17" 0 "1-Sep-17" 0 "1-Aug-17"
"""
df = pd.read_table(StringIO(txt), sep="\s+")
<强>代码强>
from datetime import timedelta, datetime
needed_cols = list(range(3, len(df.columns), 2))
df_list = []
# BUILD DF SUBSETS
for n in needed_cols:
tmp = df[df.columns[[0,1]+[n-1, n]]]
tmp.columns = ['INSTALL', 'METER_NO', 'FROM', 'CONSUMPTION']
tmp.loc[:, 'FROM'] = pd.to_datetime(tmp.loc[:, 'FROM'], format='%d-%b-%y')
tmp.loc[:, 'TO'] = tmp['FROM'].apply(lambda x: datetime(x.year, x.month + 1, 1) \
if x.month < 12 else datetime(x.year+1, 1, 1)) - timedelta(days=1)
df_list.append(tmp[['INSTALL', 'METER_NO', 'FROM', 'TO', 'CONSUMPTION']])
# CONCATENATE ALL DFs
final_df = pd.concat(df_list).sort_values('METER_NO').reset_index(drop=True)
print(final_df.head(10))
# INSTALL METER_NO FROM TO CONSUMPTION
# 0 80000006 14954563 2017-09-01 2017-09-30 0.0
# 1 80000006 14954563 2017-10-01 2017-10-31 0.0
# 2 80000006 14954563 2017-11-01 2017-11-30 0.0
# 3 80000006 14954563 2018-01-01 2018-01-31 0.0
# 4 80000006 14954563 2017-12-01 2017-12-31 0.0
# 5 80000000 19151882 2017-09-01 2017-09-30 6.1
# 6 80000000 19151882 2017-10-01 2017-10-31 4.4
# 7 80000000 19151882 2017-11-01 2017-11-30 5.5
# 8 80000000 19151882 2018-01-01 2018-01-31 5.6
# 9 80000000 19151882 2017-12-01 2017-12-31 7.9