如何在Pyspark Dataframe中训练测试拆分的时间序列数据

时间:2019-03-13 15:22:44

标签: apache-spark pyspark apache-spark-sql train-test-split

我想根据时间对分类的Pyspark数据帧进行火车测试拆分。假设前300行位于训练集中,接下来的200行位于测试组中。

我可以使用-

选择前300行
train = df.show(300)

但是如何从Pyspark数据框中选择最后200行?

1 个答案:

答案 0 :(得分:0)

假设您有一个大小为500的数据框# For this solution your DataFrame needs to be sorted by date. limit = 100 df = pd.DataFrame({ 'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'], 'Place':['A','A','A','B','B','B','C','C','C'], 'AVG': [62,14,47,25,74,60,78,27,41]}) df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']}) result = [] for row in df2.to_dict('records'): # For each date, I want to select the date that comes AFTER this one. # Then, I take the .cumsum(), because it's the agg you wish to do. # Filter by your limit and take the first occurrence. # Converting this to a dict, appending it to a list, makes it easy # to rebuild a DataFrame later. ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\ .sort_values(by='date1') ndf['avgsum'] = ndf['AVG'].cumsum() final_df = ndf.loc[ ndf['avgsum'] >= limit ] # Error handling, in case there is not avgsum above the threshold. try: final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'}) result.append( final_df.to_dict() ) except IndexError: continue df3 = pd.DataFrame(result) final_df = pd.concat([df2, df3], axis=1, sort=False) print(final_df) # date2 Place avgsum date100 # 0 1/1/2014 A 123.0 3/1/2014 # 1 2/1/2014 C NaN NaN ,该数据框由df列排序。

一种简单的处理方法是对训练集使用限制,并对测试集的反向数据帧执行相同的操作。

time