Python / Pandas执行SQL并存储结果

时间:2018-04-05 17:21:31

标签: python sql pandas

我想对一些Teradata数据库表进行一些数据分析,并快速实现表格的大小(数百万条记录)直接从数据库表到Pandas数据帧并不是最好的选择。

我提出了一个SQL查询,在运行时会给我执行所需的查询子集以获取我正在寻找的结果(distinct,max,min,null count等),我想要嵌入这是我的Python脚本。

查询如下所示:

SELECT 'SELECT ''' || TRIM(COLUMNNAME)
|| ''', COUNT(DISTINCT ' || COLUMNNAME || ') AS DISTINCT_COUNT,'
|| ' COUNT(1) - COUNT( ' || COLUMNNAME || ') AS NULL_COUNT,'
|| ' MAX( ' || COLUMNNAME || ') AS MAX_COL_VALUE,'
|| ' MIN( ' || COLUMNNAME || ') AS MIN_COL_VALUE'
|| ' FROM ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) || ';'
FROM DBC.COLUMNSV
WHERE DATABASENAME = 'XYZ'
AND TABLENAME = 'ABC';

执行该查询的结果是一组单独的查询(对于我正在测试的表,大约30个左右)。

我使用以下内容执行了上述操作....

results = pd.read_sql(SELECT 'SELECT ''' || TRIM(COLUMNNAME)
|| ''', COUNT(DISTINCT ' || COLUMNNAME || ') AS DISTINCT_COUNT,'
|| ' COUNT(1) - COUNT( ' || COLUMNNAME || ') AS NULL_COUNT,'
|| ' MAX( ' || COLUMNNAME || ') AS MAX_COL_VALUE,'
|| ' MIN( ' || COLUMNNAME || ') AS MIN_COL_VALUE'
|| ' FROM ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) || ';'
FROM DBC.COLUMNSV
WHERE DATABASENAME = 'XYZ'
AND TABLENAME = 'ABC';, session)

运行上述结果如下:

SELECT 'ColumnA_ID', COUNT(DISTINCT ColumnA_ID) AS DISTINCT_COUNT, COUNT(1) - COUNT( ColumnA_ID) AS NULL_COUNT, MAX( ColumnA_ID) AS MAX_COL_VALUE, MIN( ColumnA_ID) AS MIN_COL_VALUE FROM Table123;
SELECT 'ColumnB', COUNT(DISTINCT ColumnB) AS DISTINCT_COUNT, COUNT(1) - COUNT( ColumnB) AS NULL_COUNT, MAX( ColumnB) AS MAX_COL_VALUE, MIN( ColumnB) AS MIN_COL_VALUE FROM Table123;
SELECT 'ColumnC', COUNT(DISTINCT ColumnC) AS DISTINCT_COUNT, COUNT(1) - COUNT( ColumnC) AS NULL_COUNT, MAX( ColumnC) AS MAX_COL_VALUE, MIN( ColumnC) AS MIN_COL_VALUE FROM Table123;
....

现在我想执行这些子查询并将结果存储在某处,这就是我被困住的地方。

当我尝试这个时:

result2 = pd.read_sql(results, session)
print(result2)

我明白了:

The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

我不应该通过Pandas进行这项手术吗?我应该循环一个列表变量吗?或?

我的最终目标是有一个摘要(在数据框中?),它显示每列的表名和列名的最大/最小/不同等,这是我从初始SQL查询生成的子查询中得到的

感谢任何帮助

1 个答案:

答案 0 :(得分:0)

假设您准备好con的Teradata数据库连接: 你可能需要这样的东西:

tablename='ABC'
databasename='XYZ'

sql= "select COLUMNNAME FROM DBC.COLUMNSV where TABLENAME=\'{0}\' and DATABASENAME=\'{1}\'".format(tablename,databasename)
df_db = pd.read_sql_query(sql, con)
print(df_db)
#convert the dataframe to the friendly list structure:
cols=df_db['COLUMNNAME'].tolist()
print(cols)

analyse_sql='select '
for COLUMNAME in cols:
    analyse_sql = analyse_sql+ 'max({0}) MAX_COL_VALUE,min({0}) MIN_COL_VALUE,count(distinct {0}) DISTINCT_COUNT, '.format(COLUMNAME)    
#Remove the unwanted last comma and round the SQL
analyse_sql=analyse_sql.rsplit(',',1)[0]+' from {0}'.format(tablename)   
result2 = pd.read_sql_query(analyse_sql, con)
print(result2)

我希望它有所帮助..