分割数据帧

时间:2020-07-04 15:03:28

标签: python dataframe csv

我是否可以根据数据开头的值拆分数据帧?

我有一个数据框和一列时间。我想将它们分组。因此,我希望一个数据帧的时间在12-3之间,另一个在3-6之间,以此类推。等等。有什么办法可以做到?

我尝试使用.groupby()并输入值时遇到键盘错误。 这是我的输入:

    ACC_DATE    ACC_TIME    DAY_OF_WEEK COUNTY_NAME INJURY  COLLISION_WITH_1
978 2012-01-21  0:01    SATURDAY    Harford NO  FIXED OBJ
952 2012-01-21  0:01    SATURDAY    Anne Arundel    NO  VEH
995 2012-01-21  0:01    SATURDAY    Prince Georges  NO  VEH
1059 2012-01-22 0:01    SUNDAY      Carroll        YES  FIXED OBJ
941 2012-01-21  0:01    SATURDAY    Prince Georges  NO  FIXED OBJ
... ... ... ... ... ... ...
17535   2012-12-10  9:12    MONDAY  Frederick   NO  FIXED OBJ
17536   2012-12-10  9:12    MONDAY  Frederick   NO  FIXED OBJ
17251   2012-12-07  9:12    FRIDAY  Anne Arundel NO VEH
17507   2012-12-10  9:12    MONDAY  Dorchester  NO  FIXED OBJ
18636   2012-12-31  9:12    MONDAY  Frederick   YES NON-COLLISION

这是我正在使用的精炼数据

    ACC_TIME    COUNTY_NAME
ACC_TIME        
0:08    0:08    Allegany
0:09    0:09    Allegany
0:09    0:09    Allegany
0:10    0:10    Allegany
0:10    0:10    Allegany
... ... ...
9:09    9:09    Allegany
9:10    9:10    Allegany
9:10    9:10    Allegany
9:11    9:11    Allegany
9:12    9:12    Allegany

这是我的代码

#--> First, how can I organize my data for only county & times?
sp = df.drop(['ACC_DATE','DAY_OF_WEEK','INJURY','COLLISION_WITH_1'],axis=1)

#Next, how can I organize the data by county and time of accidents? 
sp_sorted = sp.sort_values(['COUNTY_NAME', 'ACC_TIME'], inplace=True)
# sp_sorted

#Now, I want to split sp by county.
sp.set_index(keys=['COUNTY_NAME','ACC_TIME'], drop=False,inplace=True)
names = sp['COUNTY_NAME'].unique().tolist()
times = sp['ACC_TIME'].unique().tolist()
allegany = sp.loc['Allegany']

allegany
# allegany.groupby(['9','10','11','12'])

我的预期输出是较小数据帧的列表。然后,我将使用该列表的条目作为潜在散点图或条形图中的x值。该图根据时间段(12-3、3-6等)测量事故数量

1 个答案:

答案 0 :(得分:0)

我相信这就是您要寻找的。在此示例中,我将根据“ a”列在单个数据框中构建一个数据框列表。

df = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2], "b": range(6), "c": range(6, 12)})

==>
   a  b   c
0  1  0   6
1  1  1   7
2  1  2   8
3  2  3   9
4  2  4  10
5  2  5  11

现在建立数据框列表:

df_list = []
def to_list(df):
    df_list.append(df.copy())
    return pd.Series(range(3))
df.groupby("a", as_index = False).apply(to_list)

输出:

print(df_list[0])
#    a  b  c
# 0  1  0  6
# 1  1  1  7
# 2  1  2  8

print(df_list[1])
#    a  b   c
# 3  2  3   9
# 4  2  4  10
# 5  2  5  11

如果您要分组的列的类型为datetime,您也可以这样做:

dates = pd.date_range("2020-01-01 00:00", periods=15, freq = "19min")

df = pd.DataFrame({"a": dates, "b": range(len(dates)), "c": range(10, 10+len(dates))})
print(df.head())
==>

                    a  b   c
0 2020-01-01 00:00:00  0  10
1 2020-01-01 00:19:00  1  11
2 2020-01-01 00:38:00  2  12
3 2020-01-01 00:57:00  3  13
4 2020-01-01 01:16:00  4  14
        
df_list = []
df.groupby(df.a.dt.hour, as_index = False).apply(to_list)
print(df_list[1])
==>

                    a  b   c
4 2020-01-01 01:16:00  4  14
5 2020-01-01 01:35:00  5  15
6 2020-01-01 01:54:00  6  16

print(df_list[2])
==>
                    a  b   c
7 2020-01-01 02:13:00  7  17
8 2020-01-01 02:32:00  8  18
9 2020-01-01 02:51:00  9  19