Question

我有一个庞大的数据集，我需要从Excel导入Access（~800k行）。但是，我可以忽略具有特定列值的行，这些行组成了实际数据集的90％。所以实际上，我只需要导入10％的行。

过去我一直以下列方式逐行导入Excel文件（伪代码）：

For i = 1 To EOF
    sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
    xlSheet.Cells(i, 1).Value & " ', '" & _
    xlSheet.Cells(i, 2).Value & "');"       
Next i
DoCmd.RunSQL sql

对于约800k行，这需要花费很长时间，因为每一行都会创建并运行查询。

考虑到我也可以忽略90％的行，将数据集从Excel导入Access的最快方法是什么？

我正在考虑创建一个激活了过滤器的临时excel文件。然后我只导入过滤后的Excel。

但是有比这更好/更快的方法吗？另外，通过vba访问导入excel的最快方法是什么？

提前致谢。

Answer 1

考虑为导入运行特殊的Access查询。将以下SQL添加到Access查询窗口或DAO / ADO连接中的SQL查询。包含需要命名列标题的所有WHERE子句，现在设置为HDR:No：

INSERT INTO [Table] (Column1, Column2)
SELECT *
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];

或者，如果在最终表之前需要临时临时表（删除90％的行），请运行生成表查询，但请注意此查询将替换表（如果存在）：

SELECT * INTO [NewTable]
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];

Answer 2

我参加聚会有点晚了，但是我偶然发现了这个问题，希望能提供类似问题的信息。我想我可能会分享我的解决方案，以防其他人或OP（如果他/她仍在努力）。这是我的问题以及我的工作：

我有一个已建立的Access数据库，该数据库的行数与OP大致相同（6列，约850k行）。我们每周从合作伙伴公司收到一次带有一张工作表的.xlsx文件，其数据与数据库具有相同的结构。

此文件包含整个数据库以及更新（新记录和对旧记录的更改，没有删除）。第一列包含每一行的唯一标识符。当我们通过Parfait建议的类似查询来接收文件时，Access db会更新，但是由于它是整个850k +记录，因此比较和更新需要10到15分钟或更长时间，具体取决于我们正在进行的工作。 >

由于将更改仅加载到当前Access数据库中会更快，因此我需要生成一个增量文件（最好是.txt，可以使用excel打开该文件，并在需要时另存为.xlsx）。我认为这类似于OP所寻找的东西。为此，我用c ++编写了一个小应用程序，将上周的文件与本周的文件进行比较。数据本身是字符和数字数据的混合物，为简单起见，在这里我将其称为string1至string6。看起来像这样：

Col1       Col2       Col3       Col4       Col5       Col6
string1    string2    string3    string4    string5    string6
.......
''''Through 850k rows''''

将两个.xlsx文件另存为.txt制表符分隔的文件后，它们如下所示：

Col1\tCol2\tCol3\tCol4\tCol5\tCol6\n
string1\tstring2\tstring3\tstring4\tstring5\tstring6\n
....
//Through 850k rows//

有趣的部分！我获取了旧的.txt文件，并将其存储为哈希表（使用标准库中的c ++ unordered_map）。然后，使用来自新.txt文件的输入文件流，我在新文件中使用Col1作为哈希表的键，并将所有差异输出到两个不同的文件。一个可以使用查询向数据库添加新数据，而另一个可以用于更新已更改的数据。

我听说可以创建比unordered_map更有效的哈希表，但目前效果很好，所以我会坚持下去。这是我的代码。

#include <iostream>     
#include <fstream>      
#include <string>       
#include <iterator>
#include <unordered_map>


int main()
{
    using namespace std;

    //variables
    const string myInFile1{"OldFile.txt"};
    const string myInFile2{"NewFile.txt"};
    string mappedData;
    string key;

    //hash table objects
    unordered_map<string, string> hashMap;
    unordered_map<string, string>::iterator cursor;

    //input files
    ifstream fin1;
    ifstream fin2;
    fin1.open(myInFile1);
    fin2.open(myInFile2);

    //output files
    ofstream fout1;
    ofstream fout2;
    fout1.open("For Updated.txt");  //updating old records 
    fout2.open("For Upload.txt");   //uploading new records

    //This loop takes the original input file (i.e.; what is in the database already)
    //and hashes the entire file using the Col1 data as a key. On my system this takes
    //approximately 2 seconds for 850k+ rows with 6 columns
    while(fin1)
    {
            getline(fin1, key, '\t');          //get the first column
            getline(fin1, mappedData, '\n');   //get the other 5 columns
            hashMap[key] = mappedData;         //store the data in the hash table
    }
    fin1.close();

    //output file headings
    fout1 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";
    fout2 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";

    //This loop takes the second input file and reads each line, first up to the
    //first tab delimiter and stores it as "key", then up to the new line character
    //storing it as "mappedData" and then uses the value of key to search the hash table
    //If the key is not found in the hash table, a new record is created in the upload
    //output file. If it is found, the mappedData from the file is compared to that of
    //the hash table and if different, the updated record is sent to the update output
    //file. I realize that while(fin2) is not the optimal syntax for this loop but I
    //have included a check to see if the key is empty (eof) after retrieving
    //the current line from the input file. YMMV on the time here depending on how many
    //records are added or updated (1000 records takes about another 5 seconds on my system)    
    while(fin2)
    {
        getline(fin2, key, '\t');           //get key from Col1 in the input file
        getline(fin2, mappedData, '\n');    //get the mappeData (Col2-Col6)
        if(key.empty())                     //exit the file read if key is empty
            break;
        cursor = hashMap.find(key);         //assign the iterator to the hash table at key

        if(cursor != hashMap.end())         //check to see if key in hash table
        {
            if(cursor->second != mappedData) //compare mappedData
            {          
                fout2 << key << "\t" << mappedData<< "\n";
            }
        }
        else                                //for updating old records
        {
            fout1 << key << "\t" << mappedData<< "\n";
        }
    }


    fin2.close();
    fout1.close();
    fout2.close();
    return 0;
}

我正在做一些事情，以使其成为易于使用的可执行文件（例如，读取excel.zip文件的xml结构以进行直接读取或使用ODBC连接），但是现在，我我只是测试它以确保输出正确。当然，然后必须使用类似于Parfait建议的查询将输出文件加载到Access数据库中。另外，我不确定Excel或Access VBA是否具有用于构建哈希表的库，但是如果它可以节省访问excel数据的时间，则可能值得进一步探讨。欢迎任何批评或建议。

Answer 3

您的代码稍有变化就会为您进行过滤：

Dim strTest As String
For i = 1 To EOF
    strTest=xlSheet.Cells(i, 1).Value
    if Nz(strTest)<>"" Then
        sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
        strTest & " ', '" & _
        xlSheet.Cells(i, 2).Value & "');" 
            DoCmd.RunSQL sql
      End If
Next i

我认为循环外的RunSQL只是你的伪代码中的一个错误。这将测试第一列中的Cell是否为空，但您可以使用适合您情况的任何条件进行替换。

通过VBA将大量数据集导入Excel中

3 个答案: