Question

我必须每天进行ETL，然后将其添加到单个数据框中。例如：每天ETL之后都是输出。

df1: 
    id category quantity date
    1   abc       100    01-07-18
    2   deg       175    01-07-18
    .....
df2: 
    id category quantity date
    1   abc       50     02-07-18
    2   deg       300    02-07-18
    3   zzz       250    02-07-18
    .....
df3: 
    id category quantity date
    1   abc       500    03-07-18
    .....
df4: 
    id category quantity date
    5   jjj       200    04-07-18
    7   ddd       100    04-07-18
    .....

对于每天的ETL，需要创建一个数据框，例如df1，df2，df3，...，并且在每天的ETL之后，应将该数据框与较早的日期ETL一起应用。

最终输出预期：

After day 2 output should be:
 finaldf: 
        id category quantity date
        1   abc       100    01-07-18
        2   deg       175    01-07-18
        1   abc       50     02-07-18
        2   deg       300    02-07-18
        3   zzz       250    02-07-18
        .....


After day 4 output should be:
     finaldf: 
            id category quantity date
            1   abc       100    01-07-18
            2   deg       175    01-07-18
            1   abc       50     02-07-18
            2   deg       300    02-07-18
            3   zzz       250    02-07-18
            1   abc       500    03-07-18
            5   jjj       200    04-07-18
            7   ddd       100    04-07-18
            .....

我已经使用Pandas using append function完成了此操作，但是由于数据量很大，我遇到了MemoryError。

Answer 1

PySpark的答案

将所有数据框放入列表

df_list = [df1, df2, df3, df4]
finaldf = reduce(lambda x, y: x.union(y), df_list)

finaldf将包含所有数据。

pyspark在for循环中的每个进程之后追加非常大的多个数据帧（例如：在每日ETL之后追加）

1 个答案: