Question

我正在处理大型数据帧（> 100,000行和多列）。我需要对数据框进行排序，然后将其分成预定义大小的相等大小的组。如果有剩余的行（即，行数不能被组的大小整除），则任何较小的组都需要从数据帧中删除。

例如1, 2, 3, 4, 5, 6, 7, 8, 9, 10，群组大小为3 应该分为[1, 2, 3]，[4, 5, 6]，[7, 8, 9]和10。

我有一个解决方案，可以在其中使用创建新列

list(range(len(df.index) // group_size)) * group_size

，然后使用sort()，然后使用group_by()将行分组在一起。之后，我可以filter删除小于group_size的所有组。

示例工作代码：

import pandas as pd

df = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])  # data frame has been sorted before this point and the rows are in the correct order
group_size = 3  

numbers = list(range(len(df.index) // group_size)) * group_size
numbers.sort()
numbers = pd.Series(numbers)
df = pd.concat([df, numbers], ignore_index=True, axis=1)
df.columns = ['value', 'group number']

groups = df.groupby('group number').filter(lambda x: len(x) == group_size)
print(groups)

这很好用。不幸的是，我的数据帧很大，并且运行时间太长。除了我的方法，还有其他选择吗？

Answer 1

这将为您提供DataFrames列表：

lst = [df.iloc[i:i+group_size] for i in range(0,len(df)-group_size+1,group_size)]

它仅使用内置索引，因此应该非常快。带有stop索引的坐立不安的情况是，如果最后一帧太小，则会丢弃它-您也可以使用

将其分解

lst = [df.iloc[i:i+group_size] for i in range(0,len(df),group_size)]
if len(lst[-1]) < group_size:
   lst.pop()

Answer 2

使用切片定界，然后使用ffill（）。

df['group'] = df[::3]
df['group'].ffill(inplace=True)

现在您可以进行分组，并丢弃太小的分组。

# df has a RangeIndex, so we get to slice 
group_size = 3
df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})  # data frame has been sorted before this point and the rows are in the correct order
slices = df[::group_size]

# but you don't want the group number to be the ordinal at the slices
# so make a copy of the slice to assign good group numbers to it (or get a chained assignment warning)
slices=slices.copy()
slices['group'] = [i for i in range(len(slices))]
df['group'] = slices['group']

# ffill with the nice group numbers
df['group'].ffill(inplace=True)

#now trim the last group
last_group = df['group'].max()
if len(df[df['group']==last_group]) < group_size:
    df = df[df['group'] != last_group]

print(df)

时间：

import pandas as pd
from datetime import datetime as dt
print(pd.__version__)


def test1():
    df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})  # data frame has been sorted before this point and the rows are in the correct order
    #print(df)
    group_size = 3
    numbers = list(range(len(df.index) // group_size)) * group_size
    numbers.sort()
    numbers = pd.Series(numbers)
    df = pd.concat([df, numbers], ignore_index=True, axis=1)
    df.columns = ['value', 'group number']
    groups = df.groupby('group number').filter(lambda x: len(x) == group_size)
    #print(groups)

def test2():
    # Won't work well because there is no easy way to calculate the remainder that should
    # not be grouped.  But cut() is good for discretizing continuous values
    df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})  # data frame has been sorted before this point and the rows are in the correct order
    num_groups = len(df.index)/3
    df['group'] = pd.cut(df['a'], num_groups, right=False)
    #print(df)

def test3():
    # df has a RangeIndex, so we get to slice 
    df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})  # data frame has been sorted before this point and the rows are in the correct order
    df['group'] = df[::3]
    df['group'].ffill(inplace=True)
    #print(df['group'])

def test4():
    # A mask can also be used
    df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})  # data frame has been sorted before this point and the rows are in the correct order
    df['group'] = df[df.index % 3 == 0]
    df['group'].ffill(inplace=True)
    #print(df)

def test5():
    # maybe go after grouping with iloc
    df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})  # data frame has been sorted before this point and the rows are in the correct order
    group = 0
    for i in range(0,len(df), 3):
        df.loc[i:i+3, 'group'] = group
        group+=1
    #print(df)


funcs = [test1, test2, test3, test4, test5]
for func in funcs:
    print(func.__name__)
    a = dt.now()
    for i in range(1000):
        func()
    b = dt.now()
    print(b-a)

Answer 3

这是Perigon答案的变体。就我而言，我不想丢掉最后几个，因此这说明了如何将其余的放入最终列表。我正在阅读CSV，并且想进行多处理，因此我会将较小的数据框传递给单独的进程，并且不会丢失CSV中的任何行。因此，在我的情况下，desired_number_per_group设置为我要多进程的相同进程数。

    import pandas as pd
    
    test_dict = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
    
    
    df = pd.DataFrame.from_dict(test_dict)
    
    print ('Size of dataFrame=', len(df.index))
    desired_number_of_groups = 4
    group_size = int(len(df.index) / (desired_number_of_groups))
    print("group_size=", group_size)
    remainder_size = len(df.index) % group_size
    print("remainder_size=", remainder_size)
    df_split_list = [df.iloc[i:i + group_size] for i in range(0, len(df) - group_size + 1, group_size)]
    print("Number of split_dataframes=", len(df_split_list))
    if remainder_size > 0:
        df_remainder = df.iloc[-remainder_size:len(df.index)]
        df_split_list.append(df_remainder)
    print("Revised Number of split_dataframes=", len(df_split_list))
    print("Splitting complete, verifying counts")
    
    count_all_rows_after_split = 0
    for index, split_df in enumerate(df_split_list):
        print("split_df:", index, " size=", len(split_df.index))
        count_all_rows_after_split += len(split_df.index)
    
    if count_all_rows_after_split != len(df.index):
        raise Exception('count_all_rows_after_split = ', count_all_rows_after_split,
                         " but original CSV DataFrame has count =", len(df.index)
                         )

Rich在单元测试用例上做得更好。我只是以1:17，然后是1:18，然后是1:19，然后是1:20，然后是1:21测试了test_dict）

如何将数据帧分成固定大小的组？

3 个答案: