Question

使用MySQL的executemany()插入数据时是否有内置的方法来忽略字典中的字段？

我需要从JSON文件中提供给我的相对较大的数据集中插入数据。因此，JSON数据的基本布局是：

{
    "data" : [
        { "f1" : 42, "f2" : "abc", "f99" : "useless stuff" },
        { "f1" : 43, "f2" : "def", "f99" : [ "junk", "here" ] },
        { "f1" : 44, "f2" : "ghi", "f99" : { "thing" : 99 } }
    ]
}

我有一个如下所示的插入设置：

import json
import mysql.connector
with open( 'huge_data_dump.json', 'rb' ) as fh:
    data = json.load( fh )
connection = mysql.connector.connect( **mysql_config )
cursor = connection.cursor()
query = 'INSERT INTO `example` ( `f1`, `f2` ) VALUES ( %(f1)s, %(f2)s )'
cursor.executemany( query, data[ 'data' ] )
cursor.close()
connection.close()

目标表格如下所示：

CREATE TABLE `example` ( `f1` INT, `f2` VARCHAR( 10 ) )

然而，当我运行它时，我收到一个错误：

Failed processing pyformat-parameters; Python 'list' cannot be converted to a MySQL type

如果我将导入仅限制为示例数据集中的第一行，则插入效果非常好：

cursor.executemany( query, data[ 'data' ][ : 1 ] )

问题来自f99字段中的无关数据，包含谁知道什么。哪个我是好的：我不想要f99的任何信息。但是，MySQL连接器似乎想要在检查查询之前将整个记录的字典转换为安全字符串，以查看是否需要该值。

我尝试使用生成器函数将数据集过滤到对executemany()的调用中，但是连接器抱怨只能接受元组和列表（我觉得这是一个非Pythonic接口）。 / p>

我的最后一招是将数据复制到新词典中，并在将数据传递给executemany()之前过滤掉不需要的字段。但是，这些数据集已经足够大，我正在考虑一次从几百个插入的组中从JSON源文件中流式传输它们。尝试消除所有不需要的数据的其他循环将是一种浪费，并且需要维护更多代码。我真诚地希望我忽略了文档没有涵盖或隐藏的内容。

我想我可以开始研究输入上的一些自定义JSON过滤，但是，我再次希望有一种简单的内置方法来解决（似乎是）一个相对常见的用例。

Answer 1

您可以使用生成器为数据列表中的每条记录创建所需列的元组：

(d["f1"], d["f2"] for d in data['data'])

将此生成器传递给executemany-function应该按预期工作。

编辑：您可能需要将查询更改为

query = 'INSERT INTO `example` ( `f1`, `f2` ) VALUES ( %s, %s )'

但我对此并不十分肯定。

Answer 2

来自未来的人们：

在打了一会儿之后，我决定从输入端攻击这个问题。

内置的JSON实现并非完全支持流式传输，但您可以在将JSON数据加载到解释器的内存中时指定JSON数据的各个部分的自定义解码。滥用我能够拦截所有对象到字典解码的能力，我决定继续操作那里的传入数据。

另外值得注意的是：MySQL连接器对一个事务中传递的数据量有一些限制，因此我继续在我的解码器中缓存了几百个这些“转换”的字典，并将它们作为数据库插入到数据库中该文件由JSON load()函数读取。

简而言之：

import json

class CustomDecoder( json.JSONDecoder ):

    allowed = [ 'f1', 'f1' ]

    def __init__( self, **kwargs ):
        kwargs[ 'object_hook' ] = self.object_to_dict
        super( CustomDecoder, self ).__init__( **kwargs )
        self.cache = []

    def object_to_dict( self, data ):

        # this check just identifies the object we need to insert
        if 'f1' in data:

            # permit allowed fields from the incoming dictionary
            data = dict(
                ( k, v )
                for k, v in data.iteritems()
                if k in self.allowed
            )

            # add data to batch cache
            self.cache.append( data )

            # check cache status
            if len( self.cache ) >= 200:

                # insert the cached records as a group for performance
                ### cursor.executemany( query, self.cache )

                # dump the cache
                self.cache = []

        # return modified data or pass through un-modified data
        return data

# connect to database, grab a cursor to it, set up the query

with open( 'import_file.json', 'rb' ) as fh:
    data = json.load( fh, cls = CustomDecoder )

# at this point, everything but left-overs in the cache should be inserted

注意事项：

解析器加载完数据后，您仍需要插入缓存中的任何剩余部分。我最终在CustomDecoder实例之外维护缓存，所以我可以在内部创建的CustomDecoder消失后刷新它。
管理查询和游标对象需要更多代码来保持接口相对干净。我决定创建一个分配给类属性的回调处理程序。回调处理程序碰巧知道如何找到当前游标和查询。

Python MySQL Connector executemany with Extra Values

2 个答案:

简而言之：

注意事项：