我有一个表(来自日志文件),其中包含电子邮件和其他三列,其中包含该用户与系统交互的状态,电子邮件(用户)可能有100或1000个条目,每个条目包含这三个条目值的组合,可能会在同一电子邮件和其他电子邮件上重复使用。 看起来像这样:
+---------+---------+---------+-----+
| email | val1 | val2 | val3 |
+---------+---------+---------+-----+
|jal@h | cast | core | cam |
|hal@b |little ja| qar | ja sa |
|bam@t | cast | core | cam |
|jal@h |little ja| qar | jaja |
+---------+---------+---------+-----+
等等,电子邮件重复,所有值都重复,每列有40多个可能的值,所有字符串。所以我想对不同的电子邮件进行排序,然后将所有可能的值作为列名称,并在其下计算特定电子邮件重复多少这个值,如下所示:
+-------+--------+--------+------+----------+-----+--------+-------+
| email | cast | core | cam | little ja| qar | ja sa | blabla |
+-------+--------+--------+------+----------+-----+--------+--------|
|jal@h | 55 | 2 | 44 | 244 | 1 | 200 | 12 |
|hal@b | 900 | 513 | 101 | 146 | 2 | 733 | 833 |
|bam@t | 1231 | 33 | 433 | 411 | 933 | 833 | 53 |
+-------+--------+--------+------+----------+-----+--------+---------
我尝试过mysql,但我设法为每封电子邮件计算一定值的总出现次数,但不计算每列中所有可能的值:
SELECT
distinct email,
count(val1) as "cast"
FROM table1
where val1 = 'cast'
group by email
这个查询显然没有这样做,因为它只输出值' cast'从第一列' val1',我要查找的是第一,第二和第三列中的所有不同值都转为列头,行中的值将是该值的总和,对于某个电子邮件用户' 有一个关键的桌子的东西,但我无法让它工作。 我将此数据作为mysql中的表处理,但它在csv文件中可用,因此如果查询不可能,python将是一种可能的解决方案,并且优先于sql。
更新 在python中,是否可以输出数据:
+-------+--------+---------+------+----------+-----+--------+-------+
| | val1 | val2 | val3 |
+-------+--------+---------+------+----------+-----+--------+-------+
| email | cast |little ja|core | qar |cam | ja sa | jaja |
+-------+--------+---------+------+----------+-----+--------+--------|
|jal@h | 55 | 2 | 44 | 244 | 1 | 200 | 12 |
|hal@b | 900 | 513 | 101 | 146 | 2 | 733 | 833 |
|bam@t | 1231 | 33 | 433 | 411 | 933 | 833 | 53 |
+-------+--------+--------+------+----------+-----+--------+---------
我对python不是很熟悉。
答案 0 :(得分:3)
如果您使用pandas,则可以在通过电子邮件对数据框进行分组后再value_counts
进行unstack/pivot
,然后(df.set_index("email").stack().groupby(level=0).value_counts()
.unstack(level=1).reset_index().fillna(0))
将其设置为宽屏格式:
stack
要获取更新的结果,您可以在(df.set_index("email").stack().groupby(level=[0, 1]).value_counts()
.unstack(level=[1, 2]).fillna(0).sort_index(axis=1))
之后按电子邮件和val *列进行分组:
Company backend 'company-lua' could not be initialized:
Company found no Lua executable
答案 1 :(得分:2)
我重新构建数据框,然后使用pd.value_counts
v = df.values
s = pd.Series(v[:, 1:].ravel(), v[:, 0].repeat(3))
s.groupby(level=0).value_counts().unstack(fill_value=0)
cam cast core ja sa jaja little ja qar
bam@t 1 1 1 0 0 0 0
hal@b 0 0 0 1 0 1 1
jal@h 1 1 1 0 1 1 1
答案 2 :(得分:0)
如果您知道该列表,则可以使用group by
计算它:
SELECT email,
sum(val1 = 'cast') as `cast`,
sum(val1 = 'core') as `core`,
sum(val1 = 'cam') as `cam`,
. . .
FROM table1
GROUP BY email;
. . .
用于填写剩余值。
答案 3 :(得分:0)
您可以使用此查询从表格中的值val1-val3生成PREPARED语句动态:
SELECT
CONCAT( "SELECT email,\n",
GROUP_CONCAT(
CONCAT (" SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
SEPARATOR ',\n'),
"\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO @myquery
FROM (
SELECT val1 FROM table1
UNION SELECT val2 FROM table1
UNION SELECT val3 FROM table1
) AS vals
ORDER BY val1;
-- ONLY TO VERIFY QUERY
SELECT @myquery;
PREPARE stmt FROM @myquery;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
样本表
mysql> SELECT * FROM table1;
+----+-------+-----------+------+-------+
| id | email | val1 | val2 | val3 |
+----+-------+-----------+------+-------+
| 1 | jal@h | cast | core | cam |
| 2 | hal@b | little ja | qar | ja sa |
| 3 | bam@t | cast | core | cam |
| 4 | jal@h | little ja | qar | cast |
+----+-------+-----------+------+-------+
4 rows in set (0,00 sec)
生成查询
mysql> SELECT
-> CONCAT( "SELECT email,\n",
-> GROUP_CONCAT(
-> CONCAT (" SUM(IF('",val1,"' IN(val1,val2,val3),1,0)) AS '",val1,"'")
-> SEPARATOR ',\n'),
-> "\nFROM table1\nGROUP BY EMAIL\nORDER BY email") INTO @myquery
-> FROM (
-> SELECT val1 FROM table1
-> UNION SELECT val2 FROM table1
-> UNION SELECT val3 FROM table1
-> ) AS vals
-> ORDER BY val1;
Query OK, 1 row affected (0,00 sec)
验证查询
mysql> -- ONLY TO VERIFY QUERY
mysql> SELECT @myquery;
SELECT email,
SUM(IF('cast' IN(val1,val2,val3),1,0)) AS 'cast',
SUM(IF('little ja' IN(val1,val2,val3),1,0)) AS 'little ja',
SUM(IF('core' IN(val1,val2,val3),1,0)) AS 'core',
SUM(IF('qar' IN(val1,val2,val3),1,0)) AS 'qar',
SUM(IF('cam' IN(val1,val2,val3),1,0)) AS 'cam',
SUM(IF('ja sa' IN(val1,val2,val3),1,0)) AS 'ja sa'
FROM table1
GROUP BY EMAIL
ORDER BY email
1 row in set (0,00 sec)
执行查询
mysql> PREPARE stmt FROM @myquery;
Query OK, 0 rows affected (0,00 sec)
Statement prepared
mysql> EXECUTE stmt;
+-------+------+-----------+------+------+------+-------+
| email | cast | little ja | core | qar | cam | ja sa |
+-------+------+-----------+------+------+------+-------+
| bam@t | 1 | 0 | 1 | 0 | 1 | 0 |
| hal@b | 0 | 1 | 0 | 1 | 0 | 1 |
| jal@h | 2 | 1 | 1 | 1 | 1 | 0 |
+-------+------+-----------+------+------+------+-------+
3 rows in set (0,00 sec)
mysql> DEALLOCATE PREPARE stmt;
Query OK, 0 rows affected (0,00 sec)
mysql>