Question

尝试查询具有超过1800万行的mysql表。我需要的只是一个简单的方法：

select date, url, count(*) from table
where date > '2018-01-01' and date < current_date

但是它在15-20分钟后崩溃。我尝试使用pandas模块连接到python中的db，然后将单日的数据附加到一个空数据框中。但是仍然坐在我的手指上摇晃...

import pandas as pd
import pymysql
import time

conn = pymysql.connect(...)

result = []
for date in pd.date_range(start='01/01/2019', end=pd.to_datetime(time.strftime('%d/%m/%Y'))):
    query = ("select * from table where time >= '{}' and time < '{}'").format(date, date + pd.DateOffset(days=1))
    df = pd.read_sql(query, con=conn)
    result.append(df)
pd.concat(result,axis=0)
print(result)

获取此数据我有哪些选择？主要目的是将这些数据导入Tableau并从那里获取...

Answer 1

I started a mysql server in docker like this, just using the defaults:

docker run -d --rm --name mysql -e MYSQL_ALLOW_EMPTY_PASSWORD=true mysql

And created database like this:

docker exec -it mysql mysql -e 'create database if not exists test'

And then connect interactive session like this:

docker exec -it mysql mysql test

I then filled it with 32 million some random dates by running this...

INSERT into dates select date(from_unixtime(rand()*unix_timestamp(now())) );

and then running this a few dozen times:

INSERT into dates select date(from_unixtime(rand()*unix_timestamp(now())) ) from dates;

Now I have almost twice as many dates as you do:

mysql> explain select * from dates;
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows     | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------+
|  1 | SIMPLE      | dates | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 33497947 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------+
1 row in set, 1 warning (0.00 sec)

Finally I can demonstrate how quickly I can search through the table:

mysql>  select count(*), d from dates where d between '2001-01-01' and '2001-12-31' group by d order by d desc;  
....
365 rows in set (4 min 31.17 sec)

Makes sense, there were a few thousand results for every day in 2001. (Remember these dates are randomly distributed between 1970 - epoch - and now).

No indexes or anything and no sql tuning. Took 4.5 minutes. Hopefully that gives you a baseline for expectations on your server and query performance.

Answer 2

使用python创建一个for循环查询并汇总“不可查询”表中一天的数据量，并将其附加到csv中，然后我将其连接到BI工具。还尝试在数据库中创建带有一些索引的新表，然后以相同的方式进行循环，但将其追加到表中。

查询具有1800万以上行的mysql数据库的替代解决方案

2 个答案: