尝试查询具有超过1800万行的mysql表。我需要的只是一个简单的方法:
select date, url, count(*) from table
where date > '2018-01-01' and date < current_date
但是它在15-20分钟后崩溃。我尝试使用pandas模块连接到python中的db,然后将单日的数据附加到一个空数据框中。但是仍然坐在我的手指上摇晃...
import pandas as pd
import pymysql
import time
conn = pymysql.connect(...)
result = []
for date in pd.date_range(start='01/01/2019', end=pd.to_datetime(time.strftime('%d/%m/%Y'))):
query = ("select * from table where time >= '{}' and time < '{}'").format(date, date + pd.DateOffset(days=1))
df = pd.read_sql(query, con=conn)
result.append(df)
pd.concat(result,axis=0)
print(result)
获取此数据我有哪些选择? 主要目的是将这些数据导入Tableau并从那里获取...
答案 0 :(得分:2)
I started a mysql server in docker like this, just using the defaults:
docker run -d --rm --name mysql -e MYSQL_ALLOW_EMPTY_PASSWORD=true mysql
And created database like this:
docker exec -it mysql mysql -e 'create database if not exists test'
And then connect interactive session like this:
docker exec -it mysql mysql test
I then filled it with 32 million some random dates by running this...
INSERT into dates select date(from_unixtime(rand()*unix_timestamp(now())) );
and then running this a few dozen times:
INSERT into dates select date(from_unixtime(rand()*unix_timestamp(now())) ) from dates;
Now I have almost twice as many dates as you do:
mysql> explain select * from dates;
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------+
| 1 | SIMPLE | dates | NULL | ALL | NULL | NULL | NULL | NULL | 33497947 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------+
1 row in set, 1 warning (0.00 sec)
Finally I can demonstrate how quickly I can search through the table:
mysql> select count(*), d from dates where d between '2001-01-01' and '2001-12-31' group by d order by d desc;
....
365 rows in set (4 min 31.17 sec)
Makes sense, there were a few thousand results for every day in 2001. (Remember these dates are randomly distributed between 1970 - epoch - and now).
No indexes or anything and no sql tuning. Took 4.5 minutes. Hopefully that gives you a baseline for expectations on your server and query performance.
答案 1 :(得分:0)
使用python创建一个for循环查询并汇总“不可查询”表中一天的数据量,并将其附加到csv中,然后我将其连接到BI工具。还尝试在数据库中创建带有一些索引的新表,然后以相同的方式进行循环,但将其追加到表中。