我想对一些Teradata数据库表进行一些数据分析,并快速实现表格的大小(数百万条记录)直接从数据库表到Pandas数据帧并不是最好的选择。
我提出了一个SQL查询,在运行时会给我执行所需的查询子集以获取我正在寻找的结果(distinct,max,min,null count等),我想要嵌入这是我的Python脚本。
查询如下所示:
SELECT 'SELECT ''' || TRIM(COLUMNNAME)
|| ''', COUNT(DISTINCT ' || COLUMNNAME || ') AS DISTINCT_COUNT,'
|| ' COUNT(1) - COUNT( ' || COLUMNNAME || ') AS NULL_COUNT,'
|| ' MAX( ' || COLUMNNAME || ') AS MAX_COL_VALUE,'
|| ' MIN( ' || COLUMNNAME || ') AS MIN_COL_VALUE'
|| ' FROM ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) || ';'
FROM DBC.COLUMNSV
WHERE DATABASENAME = 'XYZ'
AND TABLENAME = 'ABC';
执行该查询的结果是一组单独的查询(对于我正在测试的表,大约30个左右)。
我使用以下内容执行了上述操作....
results = pd.read_sql(SELECT 'SELECT ''' || TRIM(COLUMNNAME)
|| ''', COUNT(DISTINCT ' || COLUMNNAME || ') AS DISTINCT_COUNT,'
|| ' COUNT(1) - COUNT( ' || COLUMNNAME || ') AS NULL_COUNT,'
|| ' MAX( ' || COLUMNNAME || ') AS MAX_COL_VALUE,'
|| ' MIN( ' || COLUMNNAME || ') AS MIN_COL_VALUE'
|| ' FROM ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) || ';'
FROM DBC.COLUMNSV
WHERE DATABASENAME = 'XYZ'
AND TABLENAME = 'ABC';, session)
运行上述结果如下:
SELECT 'ColumnA_ID', COUNT(DISTINCT ColumnA_ID) AS DISTINCT_COUNT, COUNT(1) - COUNT( ColumnA_ID) AS NULL_COUNT, MAX( ColumnA_ID) AS MAX_COL_VALUE, MIN( ColumnA_ID) AS MIN_COL_VALUE FROM Table123;
SELECT 'ColumnB', COUNT(DISTINCT ColumnB) AS DISTINCT_COUNT, COUNT(1) - COUNT( ColumnB) AS NULL_COUNT, MAX( ColumnB) AS MAX_COL_VALUE, MIN( ColumnB) AS MIN_COL_VALUE FROM Table123;
SELECT 'ColumnC', COUNT(DISTINCT ColumnC) AS DISTINCT_COUNT, COUNT(1) - COUNT( ColumnC) AS NULL_COUNT, MAX( ColumnC) AS MAX_COL_VALUE, MIN( ColumnC) AS MIN_COL_VALUE FROM Table123;
....
现在我想执行这些子查询并将结果存储在某处,这就是我被困住的地方。
当我尝试这个时:
result2 = pd.read_sql(results, session)
print(result2)
我明白了:
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
我不应该通过Pandas进行这项手术吗?我应该循环一个列表变量吗?或?
我的最终目标是有一个摘要(在数据框中?),它显示每列的表名和列名的最大/最小/不同等,这是我从初始SQL查询生成的子查询中得到的
感谢任何帮助
答案 0 :(得分:0)
假设您准备好con
的Teradata数据库连接:
你可能需要这样的东西:
tablename='ABC'
databasename='XYZ'
sql= "select COLUMNNAME FROM DBC.COLUMNSV where TABLENAME=\'{0}\' and DATABASENAME=\'{1}\'".format(tablename,databasename)
df_db = pd.read_sql_query(sql, con)
print(df_db)
#convert the dataframe to the friendly list structure:
cols=df_db['COLUMNNAME'].tolist()
print(cols)
analyse_sql='select '
for COLUMNAME in cols:
analyse_sql = analyse_sql+ 'max({0}) MAX_COL_VALUE,min({0}) MIN_COL_VALUE,count(distinct {0}) DISTINCT_COUNT, '.format(COLUMNAME)
#Remove the unwanted last comma and round the SQL
analyse_sql=analyse_sql.rsplit(',',1)[0]+' from {0}'.format(tablename)
result2 = pd.read_sql_query(analyse_sql, con)
print(result2)
我希望它有所帮助..