Question

我在pandas中有一个包含以下信息的数据框

对TRANSACTION_ID中的每个条目使用for循环，我调用以下函数，

def checkForImages(TransNum):
"""pass function a transaction number and get the string with image found information then store that
string into the same row in a new column"""
try:
    cursor.execute('select CAMERA_TYPE from VEHICLE_IMAGE where TRANSACTION_ID=' + str(TransNum))
    result = ''
    for img_type in cursor:
        result = result + img_type[0]
    if result == '':
        result = 'No image available'
    print 'Images found: ' + str(TransNum) + " "+ result
    resultSort = result.split()
    resultSort.sort()
    result = ''
    for i in range(len(resultSort)):
        result = result + " " + resultSort[i]
    cursor.close()
    return result
except Exception as e:
    # print 'Error occured while getting image references: ', e
    pass

此函数返回一个字符串，该字符串为“无图像可用”或具有图像信息（如果找到）。我必须在填充此结果的数据框中创建一个新列，以便我的最终数据框看起来像这样

我的问题是：如何加快这个过程？在100k +条目的行上使用for循环是非常缓慢和痛苦的。我查看了 dataframe.map 和 dataframe.apply 等功能，但未能使其正常工作。我看到的其他选项是使用cython或多线程。我应该在哪个选项上投入时间？任何帮助表示赞赏

Answer 1

您为每个事务查询Oracle，然后在循环中为每个事务聚合获取的数据 - 效率非常低。

首先，我会创建一个＆＃34;映射＆＃34; DataFrame如下：

transaction_id               images
           111   No image available
           112           FRONT REAR
           113             OVERVIEW

可以使用default settings：

来完成

qry = """
select
  transaction_id,
  NVL(listagg(camera_type, ' ') within group (order by camera_type), 'No image available') as images
from vehicle_image group by transaction_id
"""

# `engine` - is a SQLAlchemy engine connection    ...
cam = pd.read_sql(qry, con=engine, index_col=['transaction_id'])

之后我们可以使用Series.map()方法：

df['Image_Found'] = df.transaction_id.map(cam.images)

在pandas

1 个答案: