熊猫列名称打印而不是整个DataFrame

时间:2019-03-20 14:34:27

标签: python python-3.x pandas multiprocessing

我有一些代码用read_sql()创建一个生成器,并循环遍历该生成器以打印每个块:

execute.py

import pandas as pd
from sqlalchemy import event, create_engine

engine = create_engine('path-to-driver')

def getDistance(chunk):
    print(chunk)
    print(type(chunk))

df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1)

for chunk in df_chunks:
    result = getDistance(chunk)

它起作用,并且每个块都作为DataFrame打印。当我尝试通过这种多重处理来做同样的事情时...

outside_function.py

def getDistance(chunk):
    print(chunk)
    print(type(chunk))
    df = chunk
    return df

execute.py

import pandas as pd
from sqlalchemy import event, create_engine

engine = create_engine('path-to-driver')

df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1)

if __name__ == '__main__':
    global result
    p = Pool(20)
    for chunk in df_chunks:
        print(chunk)
        result = p.map(getDistance, chunk)
    p.terminate()
    p.join()

......块在控制台中以列名的形式打印为“ str”。打印出result将显示此['column_name']

为什么在应用多处理程序时,这些块会变成仅是列名的字符串?

1 个答案:

答案 0 :(得分:1)

这是因为p.map需要一个函数和一个可迭代的函数。遍历数据框(在这种情况下,您的chunk)将产生列名。

您需要将一组数据框传递给map方法。即:

    global result
    p = Pool(20)
    result = p.map(getDistance, df_chunks)
    p.terminate()
    p.join()