Question

我有一个非常大的（2.7mb）XML文件，其结构如下：

<?xml version="1.0"?>

<Destinations>

  <Destination>
    <DestinationId>W4R1FG</DestinationId>
    <Country>Pakistan</Country>
    <City>Karachi</City>
    <State>Sindh</State>
  </Destination>

  <Destination>
    <DestinationId>D2C2FV</DestinationId>
    <Country>Turkey</Country>
    <City>Istanbul</City>
    <State>Istanbul</State>
  </Destination>

  <Destination>
    <DestinationId>5TFV3E</DestinationId>
    <Country>Canada</Country>
    <City>Toronto</City>
    <State>Ontario</State>
  </Destination>  

  ... ... ...

</Destinations>

这样的MySQL表“目的地”：

+---+--------------+----------+---------+----------+
|id |DestinationId |Country   |City     |State     |
+---+--------------+----------+---------+----------+
|1  |W4R1FG        |Pakistan  |Karachi  |Sindh     |
+---+--------------+----------+---------+----------+
|2  |D2C2FV        |Turkey    |Istanbul |Istanbul  |
+---+--------------+----------+---------+----------+
|3  |5TFV3E        |Canada    |Toronto  |Ontario   |
+---+--------------+----------+---------+----------+
|.  |......        |......    |.......  |.......   |
+---+--------------+----------+---------+----------+

现在我想处理我的XML并检查MySQL表中的每个目标记录。我必须仅对每条记录进行DestinationId比较，并检查它是否存在于我的数据库表中。如果它确实存在，请保留该记录并继续，如果它不存在，则执行INSERT查询以在该表中插入该记录。

我首先尝试使用PHP foreach循环机制来实现这一点，但由于数据如此庞大，它导致了严重的性能和速度问题。然后我提出了一个像这样的MySQL程序方法：

DELIMITER $$

USE `destinations`$$

DROP PROCEDURE IF EXISTS `p_import_destinations`$$

CREATE DEFINER=`root`@`localhost` PROCEDURE `p_import_destinations`(
    p_xml                     TEXT
)
BEGIN
    DECLARE v_row_index INT UNSIGNED DEFAULT 0;
    DECLARE v_row_count INT UNSIGNED;
    DECLARE v_xpath_row VARCHAR(255);

    -- calculate the number of row elements.
    SET v_row_count := extractValue(p_xml,'count(/Destinations/Destination)');

    -- loop through all the row elements
    WHILE v_row_index < v_row_count DO        
        SET v_row_index := v_row_index + 1;
        SET v_xpath_row := CONCAT('/Destinations/Destination[',v_row_index,']');

    INSERT IGNORE INTO destinations VALUES (
        NULL,
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::DestinationId')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::Country')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::City')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::State'))
    );


    END WHILE;

END$$  

DELIMITER ;

查询以调用此过程：

SET @xml := LOAD_FILE('C:/Users/Muhammad Ali/Desktop/dest.xml'); 
CALL p_import_destinations(@xml);

这很完美，但我仍然不确定这种方法的可扩展性，性能和速度。此过程中使用的IGNORE子句跳过重复记录但累积自动增量键值。就像它使用id 3306检查行一样，如果此记录是重复的，它将不会在表中插入（这是一件好事）但会采用自动增量键{{1当下次插入NON-DUPLICATING记录时，它会将其插入3307。这看起来不太好。

非常感谢满足此类要求的任何其他方法。如果我可以继续使用这个解决方案，请指导我？如果没有，为什么？

请记住，我正在处理大量数据。

Answer 1

这很完美，但我仍然不确定这种方法的可扩展性，性能和速度。

衡量速度，测试它的缩放程度。那你确定。再次询问您是否发现在您的方案中会对您造成伤害的问题，但会使性能/可伸缩性问题更具体。很可能这样的部分已经问过Q＆amp; A＆＃39。如果不在此处的Stackoverflow上，而是在DBA网站上：https://dba.stackexchange.com/

此过程中使用的IGNORE子句跳过重复记录但累积自动增量键值

这是类似的。如果这些差距对您来说是一个问题，这通常会在您的数据库设计中显示出一个缺陷，因为这些差距通常没有意义（比较：How to fill in the "holes" in auto-incremenet fields?）。

然而，这并不意味着其他人也不会遇到这个问题。你可以找到很多材料，也可以找到＆＃34;技巧＆＃34;如何使用特定版本的数据库服务器来防止这种情况。但老实说，我不会关心差距。合同是标识列具有唯一值。这就是全部。

无论如何，无论是性能还是ID：为什么不将处理分开？首先从XML导入导入表，然后您可以轻松删除您不想从该导入表导入的每一行，然后您可以根据需要插入目标表。

Answer 2

使用下面描述的另一种逻辑解决了这个问题。

DELIMITER $$

USE `test`$$

DROP PROCEDURE IF EXISTS `import_destinations_xml`$$

CREATE DEFINER=`root`@`localhost` PROCEDURE `import_destinations_xml`(
    path VARCHAR(255), 
    node VARCHAR(255)
)

BEGIN
    DECLARE xml_content TEXT;
    DECLARE v_row_index INT UNSIGNED DEFAULT 0;   
    DECLARE v_row_count INT UNSIGNED;  
    DECLARE v_xpath_row VARCHAR(255); 

    -- set xml content.
    SET xml_content = LOAD_FILE(path);

    -- calculate the number of row elements.   
    SET v_row_count  = extractValue(xml_content, CONCAT('count(', node, ')')); 

    -- create a temporary destinations table
    DROP TABLE IF EXISTS `destinations_temp`;
    CREATE TABLE `destinations_temp` (
      `id` INT(11) NOT NULL AUTO_INCREMENT,
      `DestinationId` VARCHAR(32) DEFAULT NULL,
      `Country` VARCHAR(255) DEFAULT NULL,
      `City` VARCHAR(255) DEFAULT NULL,
      `State` VARCHAR(255) DEFAULT NULL,
    PRIMARY KEY (`id`)
    ) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;  

    -- loop through all the row elements    
    WHILE v_row_index < v_row_count DO                
        SET v_row_index = v_row_index + 1;        
        SET v_xpath_row = CONCAT(node, '[', v_row_index, ']');
        INSERT INTO destinations_temp VALUES (
            NULL,
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::DestinationId')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::Country')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::City')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::State'))
        );
    END WHILE;

    -- delete existing records from temporary destinations table
    DELETE FROM destinations_temp WHERE DestinationId IN (SELECT DestinationId FROM destinations);

    -- insert remaining (unmatched) records from temporary destinations table to destinations table
    INSERT INTO destinations (DestinationId, Country, City, State) 
    SELECT DestinationId, Country, City, State 
    FROM destinations_temp;

    -- creating a log file    
    SELECT  *
    INTO OUTFILE 'C:/Users/Muhammad Ali/Desktop/Destination_Import_Procedure/log/destinations_log.csv'
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\r\n'
    FROM `destinations_temp`;

    -- removing temporary destinations table
    DROP TABLE destinations_temp;

END$$

DELIMITER ;

查询以调用此过程。

CALL import_destinations_xml('C:\Users\Muhammad Ali\Desktop\Destination_Import_Procedure\dest.xml', '/Destinations/Destination');

将新的XML数据导入MySQL表而不影响现有记录

2 个答案: