计算MySQL中每个表的每列中的非空值的数量

时间:2018-04-04 05:40:41

标签: python mysql

有没有办法使用SQL为给定数据库中的所有表(使用MySQL)生成此输出,而无需指定单独的表名和列?

Table   Column  Count
----    ----    ----
Table1  Col1    0
Table1  Col2    100
Table1  Col3    0
Table1  Col4    67
Table1  Col5    0
Table2  Col1    30
Table2  Col2    0
Table2  Col3    2
... ... ...

目的是根据它们包含的数据来识别要分析的列(大量列是空的)。

'解决方法'使用python的解决方案(一次一个表):

    # Libraries
    import pymysql
    import pandas as pd
    import pymysql.cursors

    # Connect to mariaDB
    connection = pymysql.connect(host='localhost',
                                 user='root',
                                 password='my_password',
                                 db='my_database',
                                 charset='latin1',
                                 cursorclass=pymysql.cursors.DictCursor)


    # Get column metadata
    sql = """SELECT *
                FROM `INFORMATION_SCHEMA`.`COLUMNS` 
                WHERE `TABLE_SCHEMA`='my_database' 
            """
    with connection.cursor() as cursor:
      cursor.execute(sql)
      result = cursor.fetchall()

    # Store in dataframe
    df = pd.DataFrame(result)
    df = df[['TABLE_NAME', 'COLUMN_NAME']]

    # Build SQL string (one table at a time for now)
    my_table = 'my_table'
    df_my_table = df[df.TABLE_NAME==my_table].copy()
    cols = list(df_my_table.COLUMN_NAME)
    col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
    col_strings[-1] = col_strings[-1].replace(',','')
    sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])

    # Execute
    with connection.cursor() as cursor:
        cursor.execute(sql)
        result = cursor.fetchall()

结果是列名和计数字典。

2 个答案:

答案 0 :(得分:1)

基本上没有。另请参阅this answer

另请注意,上面答案的最接近匹配实际上是您已经使用的方法,但在反射SQL中实现效率较低。

我会像你一样做 - 像

一样建立一个SQL
SELECT
    COUNT(*) AS `count`,
    SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
    ...
    SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;

使用information_schema作为表名和列名的源,然后为MySQL中的每个表执行它,然后反汇编返回到N个元组条目的单行(tableName,columnName,total,nulls)。

答案 1 :(得分:1)

这是可能的,但它不会很快。

如前一个答案中所述,您可以通过information_schema中的columns表来构建查询以获取计数。这只是一个问题,你准备好等待答案多久,因为你最终计算每一行中每一列的每一行。如果排除在游标中定义为NOT NULL的列(即IS_NULLABLE =' YES'),则可以加快速度。

LSerni建议的解决方案会更快,特别是如果你有非常宽的表和/或高行数,但需要更多的工作来处理结果。

e.g。

DELIMITER //

DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN

-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');

  DECLARE vTABLE_NAME varchar(64);
  DECLARE vCOLUMN_NAME varchar(64);
  DECLARE vIS_NULLABLE varchar(3);
  DECLARE vCOLUMN_KEY varchar(3);

  DECLARE done BOOLEAN DEFAULT FALSE;

  DECLARE cur1 CURSOR FOR 
    SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY` 
    FROM `information_schema`.`columns`
    WHERE `TABLE_SCHEMA` = sname
    ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;

  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;

  DROP TEMPORARY TABLE IF EXISTS non_nulls;
  CREATE TEMPORARY TABLE non_nulls(
    table_name VARCHAR(64),
    column_name VARCHAR(64),
    column_key CHAR(3),
    is_nullable CHAR(3),
    rows BIGINT, 
    populated BIGINT 
  ); 

  OPEN cur1;

  read_loop: LOOP
    FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
    IF done THEN
      LEAVE read_loop;
     END IF;

     SET @sql := CONCAT('INSERT INTO non_nulls ',
                '(table_name,column_name,column_key,is_nullable,rows,populated) ',
                'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'', 
                 vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
                'FROM `', sname, '`.`', vTABLE_NAME, '`');

    PREPARE stmt1 FROM @sql;
    EXECUTE stmt1;
    DEALLOCATE PREPARE stmt1;

  END LOOP;

  CLOSE cur1;

  SELECT * FROM non_nulls;

END //

DELIMITER ;

call non_nulls('sakila');