我有以下数据框:
sp_id sp_dt v1 v1 v3
x1|x2|x30|x40 2018-10-07 100 200 300
x1|x2|x30|x40 2018-10-14 80 80 90
x1|x2|x30|x40 2018-10-21 34 35 36
x1|x2|x31|x41 2018-10-07 100 200 300
x1|x2|x31|x41 2018-10-14 80 80 90
x1|x2|x31|x41 2018-10-21 34 35 36
....
x1|x2|x39|x49 2018-10-21 340 350 36
以及具有以下数据的excel文件(并且excel中的每个工作表可能包含多个变量,例如v4,v5,如下所示,可能在另一个工作表中包含v6):
Variable sp_partid1 sp_partid2 2018-10-07 ... 2018-10-21
v4 x30 x40 160 ... 154
v4 x31 x41 59 ... 75
....
v4 x39 x49 75 ... 44
v5 x30 x40 16 ... 24
v5 x31 x41 59 ... 79
....
v5 x39 x49 75 ... 34
sp_partid1和sp_partid2是可选列。它们是顶部数据框中的“ sp_id的一部分”列。该文件可以没有任何列,或者在此特定示例中,最多可以包含4个这样的列,每个列都是顶部数据框中的sp_id列的一部分。
最终输出应如下所示:
sp_id sp_dt v1 v1 v3 v4 v5
x1|x2|x30|x40 2018-10-07 100 200 300 160 16
x1|x2|x30|x40 2018-10-14 80 80 90 ... ...
x1|x2|x30|x40 2018-10-21 34 35 36 154 24
x1|x2|x31|x41 2018-10-07 100 200 300 59 59
x1|x2|x31|x41 2018-10-14 80 80 90 ... ...
x1|x2|x31|x41 2018-10-21 34 35 36 75 79
....
x1|x2|x39|x49 2018-10-21 340 350 36 44 34
Edit1开始: 输出是如何产生的?
get a list of variables
check if the variable(say v4 in this case) exists in any sheet
if it does:
does it have any "part of sp_id"
#In the example shown sp_partid1 and sp_partid2 of excel sheets
#are part of sp_id of dataframe.
if yes:
#it means the part of sp_id is common for all values. (x1|x2) in this case.
add a new column to dataframe, v4, which has sp_id, sp_dt and,
the value of that date
if no:
#it means the whol sp_id is common for all values. (x1|x2|x3|x4) in this case and not shown in example.
add a new column to dataframe, v4, and copy the value under the appropriate dates in excel sheet into corresponding v4 values and sp_dt
例如,160是v4,x30,x40在2018-10-07下的值,因此最终输出中的v4在第一行显示160。
Edit1结束:
我以以下代码开始我的代码:
df # is the top data frame which I have not gotten around to using yet
var_value # gets values in a loop like 'v4, v5...'
sheets_dict = {name: pd.read_excel('excel_file.xlsx', sheet_name = name, parse_dates = True) for name in sheets}
for key, value in sheets_dict.items():
if 'Variable' in value.columns:
# 'Variable' column exists in this sheet
if var_value in value['Variable'].values:
# var_value exists in 'Variable' column (say, v4)
for column in value.columns:
if column.startswith('sp_'):
#Do something with column values, then map the values etc
答案 0 :(得分:0)
假设您的一张Excel工作表包含以下数据,
Variable sp_partid1 sp_partid2 2018-10-07 2018-10-08 2018-10-21
0 v4 x30 x40 160 10.0 154
1 v4 x31 x41 59 NaN 75
2 v4 x32 x42 75 10.0 44
3 v5 x30 x40 16 10.0 24
4 v5 x31 x41 59 10.0 79
5 v5 x32 x42 75 10.0 34
您可以结合使用melt
和pivot_table
函数来获得所需的结果。
import pandas as pd
book= pd.read_excel('del.xlsx',sheet_name=None)
for df in book.values():
df=df.melt(id_vars=['Variable','sp_partid1','sp_partid2'], var_name="Date", value_name="Value")
# concatenate strings of two columns separated by a '|'
df['sp_id'] = df['sp_partid1'] +'|'+ df['sp_partid2']
df = df.loc[:,['Variable', 'sp_id','Date','Value']]
df = df.pivot_table('Value', ['sp_id','Date'], 'Variable').reset_index( drop=False )
print(df)
>> output
Variable sp_id Date v4 v5
0 x30|x40 2018-10-07 160.0 16.0
1 x30|x40 2018-10-08 10.0 10.0
2 x30|x40 2018-10-21 154.0 24.0
3 x31|x41 2018-10-07 59.0 59.0
4 x31|x41 2018-10-08 NaN 10.0
5 x31|x41 2018-10-21 75.0 79.0
6 x32|x42 2018-10-07 75.0 75.0
7 x32|x42 2018-10-08 10.0 10.0
8 x32|x42 2018-10-21 44.0 34.0
阅读具有sheet_name = None的excel工作簿将得到一个字典,其中worksheet name
为key
,而data frame
为value
答案 1 :(得分:0)
您尝试做的事情是有道理的,但是操作序列很长,因此在实现它时遇到一些麻烦是正常的。我认为您应该回到关系数据库的更高抽象层次,并使用熊猫提供的高级数据框操作。
让我们根据高级操作来总结您想要做的事情:
sheet_dicts
数据框的格式,使其具有相同的数据,但呈现方式不同 id3 id4 date v4 v5
x30 x40 2018-10-07 160 154
x31 x41 2018-10-08 30 10
即使您的总体目标很明确,但如果规格仍然很模糊,我无法为您提供精确的实现。另外,我没有提供参考资料来指导您使用关系数据库,但是我强烈建议您了解情况,这将为您节省很多时间,尤其是在您经常需要执行此类任务时。