现有数据库和所需结果:
我有一个更大的SQLite数据库(12gb,包含4400万行的表),我想在Python3中使用Pandas进行修改。
示例目标:我希望将这些大表中的一个(4400万行)读入块中的DF,操作DF块,并将结果写入新表。如果可能的话,我想替换新表,如果它存在,并将每个块附加到它。
由于我的操作仅添加或修改列,新表应与原始表具有相同的行数。
的问题:
主要问题似乎源于以下代码中的以下行:
df.to_sql(new_table, con=db, if_exists = "append", index=False)
Traceback (most recent call last):
File "example.py", line 23, in <module>
for df in df_generator:
File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database
如果我重新运行脚本,使用相同的新表名,它会为每个块运行一个额外的块,+ 1行。
当注释掉df.to_sql()
行时,循环会运行预期数量的块。
完整代码问题的测试示例:
完整代码: example.py
import pandas as pd
import sqlite3
#Helper Functions Used in Example
def ren(invar, outvar, df):
df.rename(columns={invar:outvar}, inplace=True)
return(df)
def count_result(c, table):
([print("[*] total: {:,} rows in {} table"
.format(r[0], table))
for r in c.execute("SELECT COUNT(*) FROM {};".format(table))])
#Connect to Data
db = sqlite3.connect("test.db")
c = db.cursor()
new_table = "new_table"
#Load Data in Chunks
df_generator = pd.read_sql_query("select * from test_table limit 10000;", con=db, chunksize = 5000)
for df in df_generator:
#Functions to modify data, example
df = ren("name", "renamed_name", df)
print(df.shape)
df.to_sql(new_table, con=db, if_exists = "append", index=False)
#Count if new table is created
try:
count_result(c, new_table)
except:
pass
1。结果
#df.to_sql(new_table, con=db, if_exists = "append", index=False)
(问题行已注释掉):
$ python3 example.py
(5000, 22)
(5000, 22)
我希望这是因为示例代码将我的大表限制为10k行。
2。结果
df.to_sql(new_table, con=db, if_exists = "append", index=False)
一个。问题行没有注释掉
湾这是使用new_table运行代码的第一次:
$ python3 example.py
(5000, 22)
Traceback (most recent call last):
File "example.py", line 23, in <module>
for df in df_generator:
File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database
第3。结果
df.to_sql(new_table, con=db, if_exists = "append", index=False)
一个。问题行没有注释掉
湾上面的代码使用new_table运行第二次:
$ python3 example.py
(5000, 22)
(5000, 22)
(5000, 22)
(1, 22)
[*] total: 20,001 rows in new_table table
因此,我首先遇到第一次运行时代码中断的问题(结果2),第二次,第二次运行时的总行数(结果3)是我预期的两倍以上。
我将非常感谢任何有关如何解决此问题的建议。
答案 0 :(得分:1)
您可以尝试指定:
db = sqlite3.connect("test.db", isolation_level=None)
# ----> ^^^^^^^^^^^^^^^^^^^^
除此之外你可能会尝试增加你的chunksize,因为否则提交之间的时间是SQLite DB的缩短 - 这导致了这个错误,我猜...我也建议使用PostgreSQL,MySQL / MariaDB或者类似的东西 - 它们更可靠,适合这种数据库大小...
答案 1 :(得分:1)
以上解决方案的时间延迟
@ MaxU的解决方案将isolation_level=None
添加到数据库连接是简短而甜蜜的。但是,无论出于何种原因,它都会大大减慢将每个块写入/提交到数据库的速度。例如,当我在1200万行的表上测试解决方案时,代码需要6个多小时才能完成。相反,从几个文本文件构建原始表只需几分钟。
这种洞察力导致了一个更快但不太优雅的解决方案,花费不到7分钟在一个1200万行的表格上完成而不是超过6小时。输出行与输入行匹配,解决了我原始问题中的问题。
更快但不太优雅的解决方案
由于从文本文件/ csv文件构造原始表并使用SQL脚本加载数据,我将这种方法与Panda的块功能结合起来。基本步骤如下:
解决方案的主要代码:
import pandas as pd
import sqlite3
#Note I Used Functions I Wrote in build_db.py
#(shown below after example solution)
from build_db import *
#Helper Functions Used in Example
def lower_var(var, df):
s = df[var].str.lower()
df = df.drop(var, axis=1)
df = pd.concat([df, s], axis=1)
return(df)
#Connect to Data
db = sqlite3.connect("test.db")
c = db.cursor()
#create statement
create_table(c, "create_test.sql", path='sql_clean/')
#Load Data in Chunks
df_generator = pd.read_sql_query("select * from example_table;", con=db, chunksize = 100000)
for df in df_generator:
#functions to modify data, example
df = lower_var("name", df) #changes column order
#restore df to column order in sql table
db_order = ["cmte_id", "amndt_ind", "rpt_tp", "transaction_pgi", "image_num", "transaction_tp", \
"entity_tp", "name", "city", "state", "zip_code", "employer", "occupation", "transaction_dt", \
"transaction_amt", "other_id", "tran_id", "file_num", "memo_cd", "memo_text", "sub_id"]
df = df[db_order]
#write chunk to csv
file = "df_chunk.csv"
df.to_csv(file, sep='|', header=None, index=False)
#insert chunk csv to db
insert_file_into_table(c, "insert_test.sql", file, '|', path='sql_clean/')
db.commit()
#Count results
count_result(c, "test_indiv")
以上代码导入的用户功能
#Relavant Functions in build_db.py
def count_result(c, table):
([print("[*] total: {:,} rows in {} table"
.format(r[0], table))
for r in c.execute("SELECT COUNT(*) FROM {};".format(table))])
def create_table(cursor, sql_script, path='sql/'):
print("[*] create table with {}{}".format(path, sql_script))
qry = open("{}{}".format(path, sql_script), 'rU').read()
cursor.executescript(qry)
def insert_file_into_table(cursor, sql_script, file, sep=',', path='sql/'):
print("[*] inserting {} into table with {}{}".format(file, path, sql_script))
qry = open("{}{}".format(path, sql_script), 'rU').read()
fileObj = open(file, 'rU', encoding='latin-1')
csvReader = csv.reader(fileObj, delimiter=sep, quotechar='"')
try:
for row in csvReader:
try:
cursor.execute(qry, row)
except sqlite3.IntegrityError as e:
pass
except Exception as e:
print("[*] error while processing file: {}, error code: {}".format(file, e))
print("[*] sed replacing null bytes in file: {}".format(file))
sed_replace_null(file, "clean_null.sh")
subprocess.call("bash clean_null.sh", shell=True)
try:
print("[*] inserting {} into table with {}{}".format(file, path, sql_script))
fileObj = open(file, 'rU', encoding='latin-1')
csvReader = csv.reader(fileObj, delimiter=sep, quotechar='"')
for row in csvReader:
try:
cursor.execute(qry, row)
except sqlite3.IntegrityError as e:
pass
print(e)
except Exception as e:
print("[*] error while processing file: {}, error code: {}".format(file, e))
SQL用户脚本
--create_test.sql
DROP TABLE if exists test_indiv;
CREATE TABLE test_indiv (
cmte_id TEXT NOT NULL,
amndt_ind TEXT,
rpt_tp TEXT,
transaction_pgi TEXT,
image_num TEXT,
transaction_tp TEXT,
entity_tp TEXT,
name TEXT,
city TEXT,
state TEXT,
zip_code TEXT,
employer TEXT,
occupation TEXT,
transaction_dt TEXT,
transaction_amt TEXT,
other_id TEXT,
tran_id TEXT,
file_num NUMERIC,
memo_cd TEXT,
memo_text TEXT,
sub_id NUMERIC NOT NULL
);
CREATE UNIQUE INDEX idx_test_indiv ON test_indiv (sub_id);
--insert_test.sql
INSERT INTO test_indiv (
cmte_id,
amndt_ind,
rpt_tp,
transaction_pgi,
image_num,
transaction_tp,
entity_tp,
name,
city,
state,
zip_code,
employer,
occupation,
transaction_dt,
transaction_amt,
other_id,
tran_id,
file_num,
memo_cd,
memo_text,
sub_id
)
VALUES (
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?,
?
);
答案 2 :(得分:0)
遇到了完全相同的问题(处理> 30 GB数据)。这是我解决问题的方法: 而不是使用read_sql的Chunk功能。我决定像这样创建一个手动块循环器:
chunksize=chunk_size
offset=0
for _ in range(0, a_big_number):
query = "SELECT * FROM the_table %s offset %s" %(chunksize, offset)
df = pd.read_sql(query, conn)
if len(df)!=0:
....
else:
break