Question

我的方案如下（运行性能基准测试的测试）：

def read_sql_query(query, chunk_size, cnxn):
    try:
        df = pd.read_sql_query(query, cnxn, index_col=['product_key'], chunksize=100000)
        return df
    except Exception as e:
        print(e)

def return_chunks_in_df(df, start_date, end_date):
    try:

        sub_df = pd.DataFrame()
        for chunks in df:            
            sub_df = pd.concat([sub_df, chunks.loc[(chunks['trans_date'] > start_date) & (chunks['trans_date'] < end_date)]], ignore_index=True)
        print(sub_df.info())
        return sub_df    
    except Exception as e:
        print(e)

query = r"select * from  sales_rollup where  product_key in (select product_key from temp limit 10000)"

start_time = timeit.default_timer()
df = read_sql_query(query, 100000, cnxn)
print(df)
print('time to chunk:' + str(timeit.default_timer() - start_time))

#scenario 1
start_time = timeit.default_timer()
sub_df1 = return_chunks_in_df(df, '2015-01-01', '2016-01-01')
print('scenario1:' + str(timeit.default_timer() - start_time))

#scenario 2    
start_time = timeit.default_timer()
sub_df2 = return_chunks_in_df(df, '2016-01-01', '2016-12-31')
print('scenario2:' + str(timeit.default_timer() - start_time))

我遇到的问题是在方案2中，即使有过滤日期范围的数据，数据框也始终返回0行。我尝试循环遍历df（），但下面的循环永远不会运行：

for chunks in df:
    print(chunks.info())

如果我在执行之前再次重新创建df，我只能获得方案2的结果集：

df = read_sql_query(query, 100000, cnxn)

作为第一个执行场景的核心问题总是返回第二个没有的值。 df对象首次执行后会以某种方式过期吗？任何帮助/指针高度赞赏。

Answer 1

发电机用完了＃34;用完了＃34;在第一次运行之后：

def gen(n):
   for i in range(n):
       yield i

In [11]: g = gen(3)

In [12]: list(g)
Out[12]: [0, 1, 2]

In [13]: list(g)
Out[13]: []

为了重用它们，您可以重构以允许您将块传递给两者：

def concat_chunk(acc, chunk, start_date, end_date):
    return pd.concat([acc, chunk.loc[(chunk['trans_date'] > start_date) & (chunk['trans_date'] < end_date)]], ignore_index=True)

sub_df1 = pd.DataFrame()
sub_df2 = pd.DataFrame()
for chunk in df:
    sub_df1 = concat_chunk(sub_df1, chunk, '2015-01-01', '2016-01-01')
    sub_df2 = concat_chunk(sub_df2, chunk, '2016-01-01', '2016-12-31')

注意：以这种方式分发会甩掉你的时间......

您可能还希望将where逻辑移到SQL中：

query = r"""select * from sales_rollup
            where product_key in (select product_key from temp limit 10000) 
            and '2015-01-01' < trans_date
            and trans_date < '2016-01-01'"""

这样，也许你不需要块！

一般来说，重新使用发电机的方式是＆＃34;只是把它列为一个清单......但这通常会打破这一点（构建它的零碎）：

chunks = list(df)  # Note chunks is probably a more descriptive name...

Answer 2

sub_df = pd.DataFrame()
for chunks in df:            
    sub_df = pd.concat([sub_df, ...
print(sub_df.info())
return sub_df

不知道为什么你设置sub_df两次，它会使第一个设置无效。

要想出这种问题，你需要反思。首先你应该只运行一个命令：

sub_df = pd.concat([sub_df, ...

通过静态项而不是变量来提供参数。

如果这没问题，那么你需要找到为什么你的原始程序无法为pd.concat提供正确的参数

不能重用pandas生成器对象

2 个答案: