我有一个大型Pandas数据帧(超过2百万行),包含以下列:
Id,CandidateRegistrationID,CandidateID,OurReference,QualificationCode,ExamCode,ExamDate,QualificationName,DataSource,QuestionNo,CandidateResponse,CorrectAnswerChoice,UniquePaperNo,QuestionCode
我有一个函数将数据帧写入sqlite:
def writeDF(df,db,table):
conn = sqlite3.connect(db)
conn.text_factory = str # allows utf-8 data to be stored
df.to_sql(table, conn, flavor='sqlite', schema=None, if_exists='replace', index=False, index_label=None, chunksize=None, dtype=None)
conn.close()
关于数据的简化版本,这很好用。在完整的数据集上,我收到以下错误:
ValueError: Cannot convert identifier to UTF-8: 'Id'
Id字段只是一个整数。
我欢迎任何见解。谷歌搜索只是让我在Pandas的线上犯了错误。
Traceback (most recent call last):
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/py-csv-jmetrik/app/routes.py", line 69, in index
writeDF(data_df,db,table)
File "/py-csv-jmetrik/app/routes.py", line 27, in writeDF
df.to_sql(table, conn, flavor='sqlite', schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/core/generic.py", line 982, in to_sql
dtype=dtype)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 549, in to_sql
chunksize=chunksize, dtype=dtype)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1565, in to_sql
dtype=dtype)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 627, in __init__
self.table = self._create_table_setup()
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1377, in _create_table_setup
for cname, ctype, _ in column_names_and_types]
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1297, in _get_valid_sqlite_name
uname = _get_unicode_name(name)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1271, in _get_unicode_name
raise ValueError("Cannot convert identifier to UTF-8: '%s'" % name)
答案 0 :(得分:1)
也许您的问题有点晚了,但可以帮助其他程序员,我遇到了与您完全相同的问题,我正在使用 Pandas 读取 csv,然后尝试插入 bd sqlite。
对我有用的解决方案是在读取 csv 文件时声明 enconding 关键字:
import pandas as pd
import sqlite3
conn = sqlite3.connect('dbpath', isolation_level= None, check_same_thread=False)
df = pd.read_csv('csvfile.csv', encoding = "UTF-8-sig")
df.to_sql('tablename', con=conn, if_exists= 'append')
答案 1 :(得分:0)
乍一看,我认为没有任何理由说明为什么会发生这种情况。这是Pandas中的一个功能,它正在抛出你所看到的错误:
def _get_unicode_name(name):
try:
uname = name.encode("utf-8", "strict").decode("utf-8")
except UnicodeError:
raise ValueError("Cannot convert identifier to UTF-8: '%s'" % name)
return uname
失败的唯一方法是将字符串“Id”编码为UTF-8失败,或解码UTF-8字符串失败。并且我无法看到应该导致失败的名称“Id”。
试试这个。由于您使用的是解释性语言Python,因此请利用该事实并编辑您正在使用的库的源代码。修改/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py
并将上述功能更改为:
def _get_unicode_name(name):
try:
utf8name = name.encode("utf-8", "strict")
except UnicodeError:
raise ValueError("Cannot encode identifier to UTF-8: '%s'" % utf8name)
try:
uname = utf8name.decode("utf-8")
except UnicodeError:
raise ValueError("Cannot decode UTF-8: '%s'" % utf8name)
return uname
这至少会告诉你这两个操作中哪一个失败了。然后按如下方式运行程序:
python myscript.py >stdout.txt 2>stderr.txt
然后通过显微镜查看stderr.txt(即将最后几行或第一对几行传递给xxd
)以查看最终的字符值:
head -n 2 stderr.txt | xxd
tail -n 2 stderr.txt | xxd
您要做的是使用ValueError捕获该行,其中它为您提供导致错误的标识符的名称(在本例中为“Id”)。查看“Id”标识符中是否有任何奇怪的字符,如零宽度空格或类似的东西。这是我现在唯一能想到的。它可能没有帮助,但至少它会缩小问题......可能。