将行值转换为列,并计算所有可能值mysql的重复次数

时间:2017-02-04 23:47:25

标签: python mysql csv pandas

我有一个表(来自日志文件),其中包含电子邮件和其他三列,其中包含该用户与系统交互的状态,电子邮件(用户)可能有100或1000个条目,每个条目包含这三个条目值的组合,可能会在同一电子邮件和其他电子邮件上重复使用。 看起来像这样:

+---------+---------+---------+-----+
| email |  val1   |  val2 |  val3   |
+---------+---------+---------+-----+
|jal@h  |  cast   | core  |   cam   |
|hal@b  |little ja| qar   |  ja sa  |
|bam@t  |  cast   | core  |   cam   |
|jal@h  |little ja| qar   |  jaja   | 
+---------+---------+---------+-----+

等等,电子邮件重复,所有值都重复,每列有40多个可能的值,所有字符串。所以我想对不同的电子邮件进行排序,然后将所有可能的值作为列名称,并在其下计算特定电子邮件重复多少这个值,如下所示:

+-------+--------+--------+------+----------+-----+--------+-------+
| email | cast   |   core |  cam | little ja| qar |  ja sa | blabla |
+-------+--------+--------+------+----------+-----+--------+--------|
|jal@h  |  55    |   2    | 44   |   244    | 1   | 200    | 12     |
|hal@b  |  900   |  513   | 101  |   146    |  2  |  733   | 833    |
|bam@t  |  1231  |   33   | 433  |   411    | 933 | 833    | 53     |
+-------+--------+--------+------+----------+-----+--------+---------

我尝试过mysql,但我设法为每封电子邮件计算一定值的总出现次数,但不计算每列中所有可能的值:

SELECT 
  distinct email,

  count(val1) as "cast"
FROM table1
where val1 = 'cast'
group by email

这个查询显然没有这样做,因为它只输出值' cast'从第一列' val1',我要查找的是第一,第二和第三列中的所有不同值都转为列头,行中的值将是该值的总和,对于某个电子邮件用户' 有一个关键的桌子的东西,但我无法让它工作。 我将此数据作为mysql中的表处理,但它在csv文件中可用,因此如果查询不可能,python将是一种可能的解决方案,并且优先于sql。

更新 在python中,是否可以输出数据:

+-------+--------+---------+------+----------+-----+--------+-------+
|       |     val1         |      val2       |     val3              |
+-------+--------+---------+------+----------+-----+--------+-------+
| email | cast   |little ja|core  | qar      |cam  | ja sa  | jaja   |
+-------+--------+---------+------+----------+-----+--------+--------|
|jal@h  |  55    |   2     | 44   |   244    | 1   | 200    | 12     |
|hal@b  |  900   |  513    | 101  |   146    |  2  |  733   | 833    |
|bam@t  |  1231  |   33    | 433  |   411    | 933 | 833    | 53     |
+-------+--------+--------+------+----------+-----+--------+---------

我对python不是很熟悉。

4 个答案:

答案 0 :(得分:3)

如果您使用pandas,则可以在通过电子邮件对数据框进行分组后再value_counts进行unstack/pivot,然后(df.set_index("email").stack().groupby(level=0).value_counts() .unstack(level=1).reset_index().fillna(0)) 将其设置为宽屏格式:

stack

enter image description here

要获取更新的结果,您可以在(df.set_index("email").stack().groupby(level=[0, 1]).value_counts() .unstack(level=[1, 2]).fillna(0).sort_index(axis=1)) 之后按电子邮件和val *列进行分组:

Company backend 'company-lua' could not be initialized:
Company found no Lua executable

enter image description here

答案 1 :(得分:2)

我重新构建数据框,然后使用pd.value_counts

进行分组和取消堆叠
v = df.values
s = pd.Series(v[:, 1:].ravel(), v[:, 0].repeat(3))

s.groupby(level=0).value_counts().unstack(fill_value=0)

       cam  cast  core  ja sa  jaja  little ja  qar
bam@t    1     1     1      0     0          0    0
hal@b    0     0     0      1     0          1    1
jal@h    1     1     1      0     1          1    1

答案 2 :(得分:0)

如果您知道该列表,则可以使用group by计算它:

SELECT email,
       sum(val1 = 'cast') as `cast`,
       sum(val1 = 'core') as `core`,
       sum(val1 = 'cam') as `cam`,
       . . .
FROM table1
GROUP BY email;

. . .用于填写剩余值。

答案 3 :(得分:0)

您可以使用此查询从表格中的值val1-val3生成PREPARED语句动态:

SELECT
    CONCAT( "SELECT email,\n",
        GROUP_CONCAT(
            CONCAT ("  SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
        SEPARATOR ',\n'),
        "\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO @myquery
FROM (
    SELECT val1 FROM table1
    UNION SELECT val2 FROM table1
    UNION SELECT val3 FROM table1
) AS vals
ORDER BY val1;

-- ONLY TO VERIFY QUERY
SELECT @myquery;

PREPARE stmt FROM @myquery;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

样本表

mysql> SELECT * FROM table1;
+----+-------+-----------+------+-------+
| id | email | val1      | val2 | val3  |
+----+-------+-----------+------+-------+
|  1 | jal@h | cast      | core | cam   |
|  2 | hal@b | little ja | qar  | ja sa |
|  3 | bam@t | cast      | core | cam   |
|  4 | jal@h | little ja | qar  | cast  |
+----+-------+-----------+------+-------+
4 rows in set (0,00 sec)

生成查询

mysql> SELECT
    ->     CONCAT( "SELECT email,\n",
    ->         GROUP_CONCAT(
    ->             CONCAT ("  SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
    ->         SEPARATOR ',\n'),
    ->         "\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO @myquery
    -> FROM (
    ->     SELECT val1 FROM table1
    ->     UNION SELECT val2 FROM table1
    ->     UNION SELECT val3 FROM table1
    -> ) AS vals
    -> ORDER BY val1;
Query OK, 1 row affected (0,00 sec)

验证查询

mysql> -- ONLY TO VERIFY QUERY
mysql> SELECT @myquery;

SELECT email,
  SUM(IF('cast' IN(val1,val2,val3),1,0)) AS 'cast',
  SUM(IF('little ja' IN(val1,val2,val3),1,0)) AS 'little ja',
  SUM(IF('core' IN(val1,val2,val3),1,0)) AS 'core',
  SUM(IF('qar' IN(val1,val2,val3),1,0)) AS 'qar',
  SUM(IF('cam' IN(val1,val2,val3),1,0)) AS 'cam',
  SUM(IF('ja sa' IN(val1,val2,val3),1,0)) AS 'ja sa'
FROM table1
GROUP BY EMAIL
ORDER BY email

1 row in set (0,00 sec)

执行查询

mysql> PREPARE stmt FROM @myquery;
Query OK, 0 rows affected (0,00 sec)
Statement prepared

mysql> EXECUTE stmt;
+-------+------+-----------+------+------+------+-------+
| email | cast | little ja | core | qar  | cam  | ja sa |
+-------+------+-----------+------+------+------+-------+
| bam@t |    1 |         0 |    1 |    0 |    1 |     0 |
| hal@b |    0 |         1 |    0 |    1 |    0 |     1 |
| jal@h |    2 |         1 |    1 |    1 |    1 |     0 |
+-------+------+-----------+------+------+------+-------+
3 rows in set (0,00 sec)

mysql> DEALLOCATE PREPARE stmt;
Query OK, 0 rows affected (0,00 sec)

mysql>