我的df超过100万行,类似于此
logging
我必须根据日期对数据进行排序,然后根据值在“金额”列中出现的时间创建新的数据框。因此,如果x购买了3次,那么我需要3个不同的数据框。 first_purchase数据框将具有已购买的每个ID,即使日期或金额也是如此。 如果一个ID购买了3次,则我需要该ID在第一次购买中,然后是第二个,然后是第三个,并带有日期和金额。
手动操作很容易:-
ID Date Amount
x May 1 10
y May 2 20
z May 4 30
x May 1 40
y May 1 50
z May 2 60
x May 1 70
y May 5 80
a May 6 90
b May 8 100
x May 10 110
第二个数据帧将使用以下命令创建:-
df = df.sort_values('Date')
first_purchase = df.drop_duplicates('ID')
after_1stpurchase = df[~df.index.isin(first_purchase.index)]
如何创建循环以向我提供每个数据帧?
答案 0 :(得分:1)
IIUC,我能够实现您想要的目标。
import pandas as pd
import numpy as np
# source data for the dataframe
data = {
"ID":["x","y","z","x","y","z","x","y","a","b","x"],
"Date":["May 01","May 02","May 04","May 01","May 01","May 02","May 01","May 05","May 06","May 08","May 10"],
"Amount":[10,20,30,40,50,60,70,80,90,100,110]
}
df = pd.DataFrame(data)
# convert the Date column to datetime and still maintain the format like "May 01"
df['Date'] = pd.to_datetime(df['Date'], format='%b %d').dt.strftime('%b %d')
# sort the values on ID and Date
df.sort_values(by=['ID', 'Date'], inplace=True)
df.reset_index(inplace=True, drop=True)
print(df)
原始数据框:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 40 May 01 x
4 70 May 01 x
5 110 May 10 x
6 50 May 01 y
7 20 May 02 y
8 80 May 05 y
9 60 May 02 z
10 30 May 04 z
。
# create a list of unique ids
list_id = sorted(list(set(df['ID'])))
# create an empty list that would contain dataframes
df_list = []
# count of iterations that must be seperated out
# for example if we want to record 3 entries for
# each id, the iter would be 3. This will create
# three new dataframes that will hold transactions
# respectively.
iter = 3
for i in range(iter):
df_list.append(pd.DataFrame())
for val in list_id:
tmp_df = df.loc[df['ID'] == val].reset_index(drop=True)
# consider only the top iter(=3) values to be distributed
counter = np.minimum(tmp_df.shape[0], iter)
for idx in range(counter):
df_list[idx] = df_list[idx].append(tmp_df.loc[tmp_df.index == idx])
for df in df_list:
df.reset_index(drop=True, inplace=True)
print(df)
交易1:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 50 May 01 y
4 60 May 02 z
交易2:
Amount Date ID
0 40 May 01 x
1 20 May 02 y
2 30 May 04 z
交易3:
Amount Date ID
0 70 May 01 x
1 80 May 05 y
请注意,您的数据中有“ x”的四个事务。如果说您也想跟踪第四次迭代事务。您需要做的就是将'iter'的值更改为4,同时获得第四个数据帧,其值如下:
Amount Date ID
0 110 May 10 x