我在Pandas创建了一个大型数据库,大约有600万行文本数据。我想将其保存为SQL数据库文件,但是当我尝试将其保存时,出现内存不足RAM错误。我什至将卡盘尺寸减小到100,但仍然崩溃。
但是,如果我只是该数据框的较小版本(具有100,000行),并且将其保存到未指定chucksize的数据库中,则保存该数据框没有任何问题。
这是我的代码
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)
我的理解是,由于一次只处理100行,因此RAM的使用应反映出节省100行的情况。幕后还有其他事情吗?也许是多线程?
在运行此代码之前,我使用的是4.8 GB RAM,而Google Colab中只有12.8 GB RAM。运行以上代码会耗尽所有RAM,直到环境崩溃为止。
我希望能够将我的pandas数据框保存到SQL文件,而不会导致环境崩溃。我所处的环境是Google Colab。熊猫数据名人堂是2列,约600万行。每个单元格包含大约这样的文本:
“优势序列转导模型基于复杂 编码器/解码器中的递归或卷积神经网络 组态。性能最佳的型号还将编码器和 解码器通过注意力机制。我们提出一个新的简单 网络架构,仅基于注意力的变压器 机制,完全消除了重复和卷积。 对两个机器翻译任务的实验表明,这些模型是 质量更高,同时更可并行化并需要 大大减少了培训时间。我们的模型在 WMT 2014英德翻译任务,在 现有的最佳效果,包括超过2位BLEU的合奏。在WMT上 2014年英语到法语翻译任务,我们的模型建立了一个新的 训练后的单人模型最新式BLEU分数为41.8 在八个GPU上使用3.5天,仅占文献中最佳模型的培训成本的一小部分。我们证明了变形金刚 成功地将其应用到英语中,从而很好地概括了其他任务 大量和有限的训练数据进行选区分析。”
编辑:
我在各个阶段进行了键盘中断。这是第一次跳入RAM后键盘中断的结果
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-22-51b6e444f80d> in <module>()
----> 1 dfAllT.to_sql("CS_table23", engine, chunksize = 100)
12 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
2529 sql.to_sql(self, name, con, schema=schema, if_exists=if_exists,
2530 index=index, index_label=index_label, chunksize=chunksize,
-> 2531 dtype=dtype, method=method)
2532
2533 def to_pickle(self, path, compression='infer',
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
458 pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
459 index_label=index_label, schema=schema,
--> 460 chunksize=chunksize, dtype=dtype, method=method)
461
462
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method)
1172 schema=schema, dtype=dtype)
1173 table.create()
-> 1174 table.insert(chunksize, method=method)
1175 if (not name.isdigit() and not name.islower()):
1176 # check for potentially case sensitivity issues (GH7815)
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in insert(self, chunksize, method)
684
685 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 686 exec_insert(conn, keys, chunk_iter)
687
688 def _query_iterator(self, result, chunksize, columns, coerce_float=True,
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in _execute_insert(self, conn, keys, data_iter)
597 """
598 data = [dict(zip(keys, row)) for row in data_iter]
--> 599 conn.execute(self.table.insert(), data)
600
601 def _execute_insert_multi(self, conn, keys, data_iter):
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in execute(self, object_, *multiparams, **params)
986 raise exc.ObjectNotExecutableError(object_)
987 else:
--> 988 return meth(self, multiparams, params)
989
990 def _execute_function(self, func, multiparams, params):
/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
285 def _execute_on_connection(self, connection, multiparams, params):
286 if self.supports_execution:
--> 287 return connection._execute_clauseelement(self, multiparams, params)
288 else:
289 raise exc.ObjectNotExecutableError(self)
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
1105 distilled_params,
1106 compiled_sql,
-> 1107 distilled_params,
1108 )
1109 if self._has_events or self.engine._has_events:
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1246 except BaseException as e:
1247 self._handle_dbapi_exception(
-> 1248 e, statement, parameters, cursor, context
1249 )
1250
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
1466 util.raise_from_cause(sqlalchemy_exception, exc_info)
1467 else:
-> 1468 util.reraise(*exc_info)
1469
1470 finally:
/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
127 if value.__traceback__ is not tb:
128 raise value.with_traceback(tb)
--> 129 raise value
130
131 def u(s):
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1222 if not evt_handled:
1223 self.dialect.do_executemany(
-> 1224 cursor, statement, parameters, context
1225 )
1226 elif not parameters and context.no_parameters:
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py in do_executemany(self, cursor, statement, parameters, context)
545
546 def do_executemany(self, cursor, statement, parameters, context=None):
--> 547 cursor.executemany(statement, parameters)
548
549 def do_execute(self, cursor, statement, parameters, context=None):
KeyboardInterrupt:
这是如果我在崩溃之前立即进行键盘中断的结果
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-24-68b60fe221fe>", line 1, in <module>
dfAllT.to_sql("CS_table22", engine, chunksize = 100)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
chunksize=chunksize, dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
table.insert(chunksize, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
exec_insert(conn, keys, chunk_iter)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 599, in _execute_insert
conn.execute(self.table.insert(), data)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 988, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
distilled_params,
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1468, in _handle_dbapi_exception
util.reraise(*exc_info)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py", line 547, in do_executemany
cursor.executemany(statement, parameters)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), '__loader__', None) is not None:
File "/usr/lib/python3.6/inspect.py", line 739, in getmodule
f = getabsfile(module)
File "/usr/lib/python3.6/inspect.py", line 708, in getabsfile
_filename = getsourcefile(object) or getfile(object)
File "/usr/lib/python3.6/inspect.py", line 693, in getsourcefile
if os.path.exists(filename):
File "/usr/lib/python3.6/genericpath.py", line 19, in exists
os.stat(path)
KeyboardInterrupt
在撞车之前我又进行了一次跑步,这似乎给出了另一个不同的结果
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-f18004debe33>", line 1, in <module>
dfAllT.to_sql("CS_table25", engine, chunksize = 100)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
chunksize=chunksize, dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
table.insert(chunksize, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
exec_insert(conn, keys, chunk_iter)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in _execute_insert
data = [dict(zip(keys, row)) for row in data_iter]
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in <listcomp>
data = [dict(zip(keys, row)) for row in data_iter]
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), '__loader__', None) is not None:
File "/usr/lib/python3.6/inspect.py", line 742, in getmodule
os.path.realpath(f)] = module.__name__
File "/usr/lib/python3.6/posixpath.py", line 388, in realpath
path, ok = _joinrealpath(filename[:0], filename, {})
File "/usr/lib/python3.6/posixpath.py", line 421, in _joinrealpath
newpath = join(path, name)
KeyboardInterrupt
---------------------------------------------------------------------------
我尝试过的其他事情:
使用dropna删除所有none / nan值
dfAllT = dfAllT.applymap(str)以确保我所有的值都是字符串
dfAllT.reset_index(drop = True,inplace = True)以确保索引未对齐。
编辑:
就像注释中提到的一样,我现在尝试在循环中使用to_sql。
for i in range(586147):
print(i)
dfAllT.iloc[i*10000:(i+1)*10000].to_sql('CS_table', engine, if_exists= 'append')
此操作最终会占用我的RAM,并最终导致大约一半的崩溃。我想知道这是否表明sqlite正在将所有内容保存在内存中,以及是否有解决方法。
编辑:
我尝试了一些其他的事情,缩短了卡盘,在每一步之后都将引擎进行了处理,然后创建了新引擎。最终还是吃光了所有RAM并崩溃了。
for i in range(586147):
print(i)
engine = sqlalchemy.create_engine("sqlite:///CSTitlesSummariesData.db")
dfAllT.iloc[i*10:(i+1)*10].to_sql('CS_table', engine, index = False, if_exists= 'append')
engine.dispose()
gc.collect
我的想法:
因此,看起来整个数据库以某种方式以某种方式保留在活动内存中。
以此为基础的pandas数据帧为5 gigs(或者至少是我尝试将其转换为sqlite之前需要多少RAM)。我的系统死于大约12.72演出。我可以想象sqlite数据库占用的内存比pandas数据帧少。
答案 0 :(得分:3)
我已经使用 df.to_sql 1 年了,现在我正在努力解决我运行大量资源但它不起作用的事实。我意识到chucksize 会使你的内存过载,pandas 加载到内存中,然后由chuncks 发送。我不得不直接使用sql进行控制。 (这里是我找到解决方案的地方 -> https://github.com/pandas-dev/pandas/issues/12265 我真的鼓励你读到最后。)
如果您需要从数据库中读取数据而不会使内存过载,请检查这段代码:
def get_data_by_chunks(cls, table, chunksize: int) -> iter:
with MysqlClient.get_engine().begin() as conn:
query_count = "select COUNT(*) from my_query"
row_count = conn.execute(query_count, where).fetchone()[0]
for i in range(math.ceil(row_count / chunksize)):
query = """
SELECT * FROM my_table
WHERE my_filiters
LIMIT {offset}, {row_count};
"""
yield pd.read_sql(query, conn)
for df in get_data_by_chunks(cls, table, chunksize: int):
print(df.shape)
答案 1 :(得分:1)
从逐步执行我认为的代码开始,它就是this line,它读取后会创建一堆DataFrame:
chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
似乎是可能是错误。具体来说,这是在准备插入数据库之前发生的。
您可以做的一个窍门是在内存快速增加的同时按CTRL-C,然后看看哪一行停了下来(我敢打赌这就是这一行)。