我有一个如下数据框:
data = [
[101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
[102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
[103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'x1', 'y1', 'Date2', 'x2', 'y2'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
如果日期列的值在一行中相同,则将x和y值相加。 如果它们不相同,则将行分成两行,并保持它们的值不变。
用(伪)代码解释:
for name in df.columns:
if 'Date' in name:
for index, row in df.iterrows():
print(row[name])
# Compare the values of the dates. See if they are equal
if date1 == date2:
# Sum the values of x1, x2. And sum the values of y1, y2
if date1 != date2:
# Group by date. Create two separate rows and do not sum the values of x and y.
另一个挑战是包含一个日期的列可能少于或多于2个。但是,列名将始终包含字符串“ Date”。例如,如果存在三个具有三个不同值的不同日期列,则目标是创建三行。如果只有1个日期列,则无需进行任何修改。
desired_outcome = [[101, '1987-09-01', 3, 3], [102, '1987-09-01', 1, 1], [102, '1999-09-01', 2, 2], [103, '1999-09-01', 2, 2]]
df_desired_outcome = pd.DataFrame(desired_outcome, columns=['ID', 'Date', 'x', 'y'])
答案 0 :(得分:2)
首先使用wide_to_long
进行重塑,然后聚合sum
:
df1 = pd.wide_to_long(df.reset_index(),
stubnames=['Date','x','y'],
i=['index','ID'],
j='tmp')
df1 = df1.groupby(['index','ID','Date']).sum().reset_index(level=0, drop=True).reset_index()
print (df1)
ID Date x y
0 101 1987-09-01 3 3
1 102 1987-09-01 1 1
2 102 1999-09-01 2 2
3 103 1999-09-01 2 2
4 103 nan 0 0
如果ID
值是唯一的,则应简化解决方案:
df1 = pd.wide_to_long(df,
stubnames=['Date','x','y'],
i='ID',
j='tmp')
df1 = df1.groupby(['ID','Date']).sum().reset_index()
print (df1)
ID Date x y
0 101 1987-09-01 3 3
1 102 1987-09-01 1 1
2 102 1999-09-01 2 2
3 103 1999-09-01 2 2
4 103 nan 0 0
编辑:
如果列名不像日期列那样以1,2
结尾,则可以按前2个字母对其进行规范化,然后在上方应用解决方案(存根名称已更改):
data = [
[101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
[102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
[103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy',
'Date2', 'OPyy', 'NPyx'])
s = df.columns.to_series()
m = s.str.startswith(('ID','Date'))
s1 = s[~m].str[:2]
s2 = s1.groupby(s1).cumcount().add(1).astype(str)
s[~m] = s1 + s2
print (s)
ID ID
Date1 Date1
OPxx OP1
NPxy NP1
Date2 Date2
OPyy OP2
NPyx NP2
dtype: object
df = df.rename(columns=s)
print (df)
ID Date1 OP1 NP1 Date2 OP2 NP2
0 101 1987-09-01 1 1 1987-09-01 2 2
1 102 1987-09-01 1 1 1999-09-01 2 2
2 103 nan 0 0 1999-09-01 2 2
EDIT2:我尝试创建更通用的解决方案:
data = [
[101, '1987-09-01', 1, 1, '1987-09-01', 2, 2, 3],
[102, '1987-09-01', 1, 1, '1999-09-01', 2, 2, 3],
[103, 'nan', 0, 0, '1999-09-01', 2, 2, 3]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy', 'Date2',
'OPyy', 'NPyx', 'WZ'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
s = df.columns.to_series()
#get first 2 characters
s1 = s.str[:2]
#create groups starting by ID and Da (first 2 letters of Date)
s2 = s1.isin(['ID','Da']).cumsum().astype(str)
s = s1 + s2
print (s)
ID ID1
Date1 Da2
OPxx OP2
NPxy NP2
Date2 Da3
OPyy OP3
NPyx NP3
WZ WZ3
dtype: object
df = df.rename(columns=s)
print (df)
ID1 Da2 OP2 NP2 Da3 OP3 NP3 WZ3
0 101 1987-09-01 1 1 1987-09-01 2 2 3
1 102 1987-09-01 1 1 1999-09-01 2 2 3
2 103 NaT 0 0 1999-09-01 2 2 3
然后动态创建子名称-s1
的所有唯一值,排除ID
和index
:
print(np.setdiff1d(s1.unique(), ['ID', 'index']))
['Da' 'NP' 'OP' 'WZ']
df1 = pd.wide_to_long(df.reset_index(),
stubnames=np.setdiff1d(s1.unique(), ['ID', 'index']),
i=['index','ID1'],
j='tmp')
总和:
df2 = (df1.groupby(['index','ID1','Da'])
.sum()
.reset_index(level=0, drop=True)
.reset_index())
print (df2)
ID1 Da NP OP WZ
0 101 1987-09-01 3 3 3.0
1 102 1987-09-01 1 1 0.0
2 102 1999-09-01 2 2 3.0
3 103 1999-09-01 2 2 3.0