通过pandas数据帧大于其各部分之和的mysql表的全表SELECT

时间:2014-11-20 19:04:55

标签: python mysql pandas

users_old = pd.read_sql('SELECT {} FROM ResUsers WHERE PERMISSION=1 \
               AND DateAdded<"2013-01-01"'.format(RU_fields), db)

users_2013 = pd.read_sql('SELECT {} FROM ResUsers WHERE PERMISSION=1 \
               DateAdded>"2013-01-01" AND DateAdded<"2014-01-01"'.format(RU_fields), db)

users_2014 = pd.read_sql('SELECT {} FROM ResUsers WHERE PERMISSION=1 \
               DateAdded>"2014-01-01"'.format(RU_fields), db)

当我在ipython中运行这三个查询时,该过程最终使用大约16.5GB的内存。但是,当我运行此查询时:

users = pd.read_sql('SELECT {} FROM ResUsers WHERE PERMISSION=1'.format(RU_fields), db)

ipython进程使用越来越多的内存,直到崩溃为止。我使用的机器总共有60GB RAM。

现在,Permission和DateAdded都是非空的,所以我不知道这里会发生什么。为了进行健全性检查,我试过了:

mysql> SELECT count(*) FROM ResUsers WHERE Permission=1;
+----------+
| count(*) |
+----------+
| 31577307 |
+----------+
1 row in set (8.39 sec)

mysql> SELECT count(*) FROM ResUsers WHERE Permission=1 AND DateAdded<"2013-01-01"
    -> ;
+----------+
| count(*) |
+----------+
|  8255583 |
+----------+
1 row in set (51.13 sec)

mysql> SELECT count(*) FROM ResUsers WHERE Permission=1 AND DateAdded>"2013-01-01" AND DateAdded<"2014-01-01";
+----------+
| count(*) |
+----------+
| 11966819 |
+----------+
1 row in set (55.76 sec)

mysql> SELECT count(*) FROM ResUsers WHERE Permission=1 AND DateAdded>"2014-01-01";
| count(*) |
+----------+
| 11354972 |
+----------+
1 row in set (51.11 sec)

其中并没有真正恢复原状,如8255583 + 11966819 + 11354972 = 31577374!= 31577307,虽然它非常接近......是否有一个原因在于mysql的数量可能会减少很少?

发生了什么,或者至少,怎么可以调试这个?如果可能有某种方法可以弄清楚内存中发生了什么,因为这个调用正在发生,我可以弄明白吗?

任何想法都赞赏!

0 个答案:

没有答案