我确定已经问过并回答了这个问题,但是我太愚蠢了,找不到它。我有一个格式为
的文件StationID,年份,JanValue,FebValue,MarValue,AprilValue,...,DecValue
我想将其从每行12个月的胖文件转换为只有StationID,日期,值,年,月的瘦文件。
我整理了代码来做到这一点,并且它起作用了。它以熊猫数据框作为输入并输出数据框。但这很慢,我敢肯定我的效率很低。任何帮助将不胜感激。
def long_skinny(df):
# df is a pandas dataframe
# get min and max year from dataframe
min_year = df['year'].min()
max_year = df['year'].max()
# set startdate to Jan. 1st of the first year.
startdate = str(min_year) + "0101"
# final file will have this many periods
num_periods = ((max_year - min_year)+1)*12
# generate a pandas dataframe with a datetime index
dates = pandas.date_range(start=startdate ,periods=num_periods,freq = 'M' )
# set up an empty list
tmps = []
# find years that are in the input dataframe
avail_years = df['year'].tolist()
id_tmp = df['id']
for iyear in range(min_year, max_year+1):
# check to see if year is in the original file
if iyear in avail_years:
year_rec = df[(df['year'] == iyear)]
tmps.append(int(year_rec['tmp1']))
tmps.append(int(year_rec['tmp2']))
tmps.append(int(year_rec['tmp3']))
tmps.append(int(year_rec['tmp4']))
tmps.append(int(year_rec['tmp5']))
tmps.append(int(year_rec['tmp6']))
tmps.append(int(year_rec['tmp7']))
tmps.append(int(year_rec['tmp8']))
tmps.append(int(year_rec['tmp9']))
tmps.append(int(year_rec['tmp10']))
tmps.append(int(year_rec['tmp11']))
tmps.append(int(year_rec['tmp12']))
else:
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps_np = np.asarray(tmps, dtype=np.int64)
var_names = ["temp"]
ls_df = pandas.DataFrame(tmps_np, index = dates, columns = var_names)
# add two fields for the year and month
ls_df['year']=ls_df.index.year
ls_df['month']=ls_df.index.month
ls_df['id'] = id_tmp
return(ls_df)
答案 0 :(得分:1)
以一个假定的例子
StationID,Year,JanValue,FebValue,MarValue,AprValue,DecValue
A,2017,1,2,8,4,5
B,2017,1,2,8,4,5
A,2018,1,2,3,4,5
B,2018,1,2,3,4,5
代码看起来像这样
df = df.melt(id_vars=['StationID', 'Year'], var_name='Month', value_vars=['JanValue','FebValue','MarValue','AprValue','DecValue'])
之后,您可以使用来固定月份名称
df['Month'] = df['Month'].str.replace('Value','')
结果
StationID Year Month value
0 A 2017 Jan 1
1 B 2017 Jan 1
2 A 2018 Jan 1
3 B 2018 Jan 1
4 A 2017 Feb 2
5 B 2017 Feb 2
6 A 2018 Feb 2
7 B 2018 Feb 2
8 A 2017 Mar 8
9 B 2017 Mar 8
10 A 2018 Mar 3
11 B 2018 Mar 3
12 A 2017 Apr 4
13 B 2017 Apr 4
14 A 2018 Apr 4
15 B 2018 Apr 4
16 A 2017 Dec 5
17 B 2017 Dec 5
18 A 2018 Dec 5
19 B 2018 Dec 5
所以剩下的唯一事情就是按照您想要的方式对行进行排序 他们排序。
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
df['Month'] = pd.Categorical(df['Month'], categories=months, ordered=True)
df.sort_values(['StationID','Year','Month'], inplace=True)
对于结果
StationID Year Month value
0 A 2017 Jan 1
4 A 2017 Feb 2
8 A 2017 Mar 8
12 A 2017 Apr 4
16 A 2017 Dec 5
2 A 2018 Jan 1
6 A 2018 Feb 2
10 A 2018 Mar 3
14 A 2018 Apr 4
18 A 2018 Dec 5
1 B 2017 Jan 1
5 B 2017 Feb 2
9 B 2017 Mar 8
13 B 2017 Apr 4
17 B 2017 Dec 5
3 B 2018 Jan 1
7 B 2018 Feb 2
11 B 2018 Mar 3
15 B 2018 Apr 4
19 B 2018 Dec 5
答案 1 :(得分:0)
哦,我似乎不愿意做很多工作。
df = df.melt(id_vars=("StationID", "Year"), var_name="Month", value_name="Value")
然后您可以使用类似以下内容的月份将变量名称替换为月份:
df["Month"] = df["Month"].str.replace(...)
根据需要打包日期:
df["Date"] = pd.to_datetime(...)
等等我会更具体一些,但是如果没有您的实际数据的示例,这是我能做的最好的...