Question

我使用PyCharm Pro版本连接到AWS Athena。它连接成功，但每当我运行查询时，我得到：

请求的fetchSize超过Athena中允许的值。请减少fetchSize并重试。请参阅雅典娜有效fetchSize值的文档。

下载了Athena JDBC驱动程序

可能是什么问题？

Answer 1

关于获取大小，JDBC和AWS athena需要考虑的一件事。似乎有一个semi-documented but well known limit of 1000 rows per fetch。我知道受欢迎的PyAthenaJDBC library将其设为default fetch size。所以，这可能是你问题的一部分。

当我尝试一次获取超过1000行时，我可以产生提取大小错误。

from pyathenajdbc import connect 
conn = connect(s3_staging_dir='s3://SOMEBUCKET/', 
region_name='us-east-1')
cur = conn.cursor()
cur.execute('SELECT * FROM SOMEDATABASE.big_table LIMIT 5000')
results = cur.fetchall()
print len(results)
# Note: The cursor class actually has a setter method to 
#       keep users from setting illegal fetch sizes   
cur._arraysize = 1001 # Set array size one greater than the default
cur.execute('SELECT * FROM athena_test.big_table LIMIT 5000')
results = cur.fetchall() # Generate an error

java.sql.SQLExceptionPyRaisable: java.sql.SQLException: The requested fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try again. Refer to the Athena documentation for valid fetchSize values.

潜在的解决方案包括：

在Web GUI中运行查询，然后手动下载结果集
在您选择的编辑器/ IDE（DataGrip，Athena Web GUI等）中开发查询，并通过Python SDK将查询字符串传递给Athena。然后，您可以等待查询完成并获取结果集。
您执行查询并对结果进行分页。
如果您从Python调用SQL（我在PyCharm标签中推断），您可以使用像PyAthenaJDBC这样的库来处理页面大小调整（参见上面的示例）。

对于我的许多Python脚本，我使用类似于以下的工作流程。

import boto3
import time

sql = 'SELECT * from athena_test.big_table'

database = 'SOMEDATABASE'
bucket_name = 'SOMEBUCKET' 
output_path = '/home/zerodf/temp/somedata.csv'

client = boto3.client('athena')
config = {'OutputLocation': 's3://' + bucket_name + '/',
      'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}}

execution_results = client.start_query_execution(QueryString = sql,
                                             QueryExecutionContext =
                                             {'Database': database},
                                             ResultConfiguration = config)

execution_id = str(execution_results[u'QueryExecutionId'])
remote_file = execution_id + '.csv'

while True:
    query_execution_results = client.get_query_execution(QueryExecutionId =
                                                     execution_id)
    if query_execution_results['QueryExecution']['Status']['State'] == u'SUCCEEDED':
        break
    else:
        time.sleep(60)

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file(remote_file, output_path)

显然，生产代码更复杂。

Answer 2

我认为你应该在DataGrip的这个设置中设置适当的值

使用PyCharm通过JDBC连接到AWS Athena - fetchSize问题

2 个答案: