我正在处理大型数据帧(> 100,000行和多列)。我需要对数据框进行排序,然后将其分成预定义大小的相等大小的组。如果有剩余的行(即,行数不能被组的大小整除),则任何较小的组都需要从数据帧中删除。
例如1, 2, 3, 4, 5, 6, 7, 8, 9, 10
,群组大小为3
应该分为[1, 2, 3]
,[4, 5, 6]
,[7, 8, 9]
和10
。
我有一个解决方案,可以在其中使用创建新列
list(range(len(df.index) // group_size)) * group_size
,然后使用sort()
,然后使用group_by()
将行分组在一起。之后,我可以filter
删除小于group_size
的所有组。
示例工作代码:
import pandas as pd
df = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # data frame has been sorted before this point and the rows are in the correct order
group_size = 3
numbers = list(range(len(df.index) // group_size)) * group_size
numbers.sort()
numbers = pd.Series(numbers)
df = pd.concat([df, numbers], ignore_index=True, axis=1)
df.columns = ['value', 'group number']
groups = df.groupby('group number').filter(lambda x: len(x) == group_size)
print(groups)
这很好用。不幸的是,我的数据帧很大,并且运行时间太长。除了我的方法,还有其他选择吗?
答案 0 :(得分:0)
这将为您提供DataFrames列表:
lst = [df.iloc[i:i+group_size] for i in range(0,len(df)-group_size+1,group_size)]
它仅使用内置索引,因此应该非常快。带有stop索引的坐立不安的情况是,如果最后一帧太小,则会丢弃它-您也可以使用
将其分解lst = [df.iloc[i:i+group_size] for i in range(0,len(df),group_size)]
if len(lst[-1]) < group_size:
lst.pop()
答案 1 :(得分:0)
使用切片定界,然后使用ffill()。
df['group'] = df[::3]
df['group'].ffill(inplace=True)
现在您可以进行分组,并丢弃太小的分组。
# df has a RangeIndex, so we get to slice
group_size = 3
df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # data frame has been sorted before this point and the rows are in the correct order
slices = df[::group_size]
# but you don't want the group number to be the ordinal at the slices
# so make a copy of the slice to assign good group numbers to it (or get a chained assignment warning)
slices=slices.copy()
slices['group'] = [i for i in range(len(slices))]
df['group'] = slices['group']
# ffill with the nice group numbers
df['group'].ffill(inplace=True)
#now trim the last group
last_group = df['group'].max()
if len(df[df['group']==last_group]) < group_size:
df = df[df['group'] != last_group]
print(df)
时间:
import pandas as pd
from datetime import datetime as dt
print(pd.__version__)
def test1():
df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # data frame has been sorted before this point and the rows are in the correct order
#print(df)
group_size = 3
numbers = list(range(len(df.index) // group_size)) * group_size
numbers.sort()
numbers = pd.Series(numbers)
df = pd.concat([df, numbers], ignore_index=True, axis=1)
df.columns = ['value', 'group number']
groups = df.groupby('group number').filter(lambda x: len(x) == group_size)
#print(groups)
def test2():
# Won't work well because there is no easy way to calculate the remainder that should
# not be grouped. But cut() is good for discretizing continuous values
df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # data frame has been sorted before this point and the rows are in the correct order
num_groups = len(df.index)/3
df['group'] = pd.cut(df['a'], num_groups, right=False)
#print(df)
def test3():
# df has a RangeIndex, so we get to slice
df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # data frame has been sorted before this point and the rows are in the correct order
df['group'] = df[::3]
df['group'].ffill(inplace=True)
#print(df['group'])
def test4():
# A mask can also be used
df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # data frame has been sorted before this point and the rows are in the correct order
df['group'] = df[df.index % 3 == 0]
df['group'].ffill(inplace=True)
#print(df)
def test5():
# maybe go after grouping with iloc
df = pd.DataFrame({'a':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # data frame has been sorted before this point and the rows are in the correct order
group = 0
for i in range(0,len(df), 3):
df.loc[i:i+3, 'group'] = group
group+=1
#print(df)
funcs = [test1, test2, test3, test4, test5]
for func in funcs:
print(func.__name__)
a = dt.now()
for i in range(1000):
func()
b = dt.now()
print(b-a)
答案 2 :(得分:0)
这是Perigon答案的变体。就我而言,我不想丢掉最后几个,因此这说明了如何将其余的放入最终列表。我正在阅读CSV,并且想进行多处理,因此我会将较小的数据框传递给单独的进程,并且不会丢失CSV中的任何行。因此,在我的情况下,desired_number_per_group设置为我要多进程的相同进程数。
import pandas as pd
test_dict = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
df = pd.DataFrame.from_dict(test_dict)
print ('Size of dataFrame=', len(df.index))
desired_number_of_groups = 4
group_size = int(len(df.index) / (desired_number_of_groups))
print("group_size=", group_size)
remainder_size = len(df.index) % group_size
print("remainder_size=", remainder_size)
df_split_list = [df.iloc[i:i + group_size] for i in range(0, len(df) - group_size + 1, group_size)]
print("Number of split_dataframes=", len(df_split_list))
if remainder_size > 0:
df_remainder = df.iloc[-remainder_size:len(df.index)]
df_split_list.append(df_remainder)
print("Revised Number of split_dataframes=", len(df_split_list))
print("Splitting complete, verifying counts")
count_all_rows_after_split = 0
for index, split_df in enumerate(df_split_list):
print("split_df:", index, " size=", len(split_df.index))
count_all_rows_after_split += len(split_df.index)
if count_all_rows_after_split != len(df.index):
raise Exception('count_all_rows_after_split = ', count_all_rows_after_split,
" but original CSV DataFrame has count =", len(df.index)
)
Rich在单元测试用例上做得更好。我只是以1:17,然后是1:18,然后是1:19,然后是1:20,然后是1:21测试了test_dict)