使用Open XML从Excel读取完整表格...更快

时间:2015-04-08 20:17:48

标签: c# excel openxml openxml-sdk

警告:由于示例和结果而导致的长篇

这里有一些关于如何在列之间读取包含空单元格的Open XML电子表格行的线程。我从这里得到了一些答案reading Excel Open XML is ignoring blank cells

我能够很好地从xlsx读取一个表,但它比从CSV读取的速度快10倍,而开放的XML结构应该(?)产生更好的结果。

这是我测试代码库所得到的:

foreach (Row r in sheetData.Descendants<Row>())
{
sw.Start();

//find a row marked as "header" and get list of columns that define width of table
if (!headerRowFound)
{
    headerRowFound = CheckOXMLHeaderRow(r, workbookPart, out headerReferences);
    if (!headerRowFound)
        continue;
}

rowKey++;
//////////////////////////////////////////////////////////////////////
///////////////////here we are going to do work//////////////////////
////////////////////////////////////////////////////////////////////    

AddRow(rowKey, cols);
sw.Stop();
Debug.WriteLine("XLSX Row added in \t" + sw.ElapsedTicks.ToString() + "\tticks");
sw.Reset();
  }

在我的数据中,一行是68个单元格,其中只有5-10个填写

0。为了进行比较,通过CSV行需要大约300个滴答(闪电般快速)。 5000行添加3ms

1。代码仅通过1-4个刻度处理行枚举器

2. 这段代码只是依次抓取所有单元格并将它们存储在一行中(由于OXML性质,列顺序搞砸了)

Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
        colKey++;
        cols.Add(colKey, c); 
}
//this takes about 8-10 times longer - 10-30 ticks , still lightning fast

3. 如果我们根据列(标题)名称和行号知道在哪里查找,我们可以这样做

Hashtable cols = new Hashtable();
foreach (string column in headerReferences.Values)
{
       colKey++;
       cols.Add(colKey, GetCellValue(workbookPart, worksheetPart, column + r.RowIndex.ToString()));
}

这是MSDN示例之一,它每行有500,000个滴答声。花了几分钟来解析5000行电子表格。不能接受的。 以下是连续的每个单元格,无论是否存在

4. 我决定缩减并尝试从所有传入的单元格中检索值到HashTable

Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
        colKey++;
        cols.Add(colKey, GetValueFromCell(c, workbookPart));
}

现在每行500-1,500个刻度。如果我们只是存储没有任何顺序的值(还没有解决方案),那么闪电般快速。

5. 为了确保我保留列的顺序,我为每个新行创建标题行的空克隆,在我解析EXISTING单元格后,我根据Hashtable决定将它们放在何处检索

Hashtable cols = (Hashtable)emptyNewRow.Clone();                        
foreach (Cell c in r.Descendants<Cell>())
{
    colKey = headerReferences[GetColumnName(c.CellReference)]; //what # column is this?
    cols[colKey] = GetValueFromCell(c, workbookPart); //put value in that column
}

最终结果是每行9,000-20,000个刻度。 5,000个电子表格30秒。可行但不理想。

这是我停下来的地方。任何想法如何让它更快?如何能够快速加载大量的xlsx电子表格,以及我能做到的最好是5k行30秒?

字典对我没有任何作用,甚至没有1%的改进。无论如何,我需要在Hashtables中获得遗留改造的结果

附录:参考方法

public static string GetColumnName(string cellReference)
        {
            // Match the column name portion of the cell name.
            Regex regex = new Regex("[A-Za-z]+");
            Match match = regex.Match(cellReference);

            return match.Value;
        }

public static string GetValueFromCell(Cell cell, WorkbookPart workbookPart)
        {
            int id;
            string cellValue = cell.InnerText;

            if (cellValue.Trim().Length > 0)
            {
                if (cell.DataType != null)
                {
                    switch (cell.DataType.Value)
                    {
                        case CellValues.SharedString:

                            Int32.TryParse(cellValue, out id);
                            SharedStringItem item = GetSharedStringItemById(workbookPart, id);
                            if (item.Text != null)
                            {
                                cellValue = item.Text.Text;
                            }
                            else if (item.InnerText != null)
                            {
                                cellValue = item.InnerText;
                            }
                            else if (item.InnerXml != null)
                            {
                                cellValue = item.InnerXml;
                            }
                            break;

                        case CellValues.Boolean:
                            switch (cellValue)
                            {
                                case "0":
                                    cellValue = "FALSE";
                                    break;
                                default:
                                    cellValue = "TRUE";
                                    break;
                            }
                            break;
                    }
                }

                else
                {
                    int excelDate;
                    if (Int32.TryParse(cellValue, out excelDate))
                    {

                        var styleIndex = (int)cell.StyleIndex.Value;

                        var cellFormats = workbookPart.WorkbookStylesPart.Stylesheet.CellFormats;
                        var numberingFormats = workbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats;
                        var cellFormat = (CellFormat)cellFormats.ElementAt(styleIndex);

                        if (cellFormat.NumberFormatId != null)
                        {

                            var numberFormatId = cellFormat.NumberFormatId.Value;
                            var numberingFormat = numberingFormats.Cast<NumberingFormat>().SingleOrDefault(f => f.NumberFormatId.Value == numberFormatId);

                            if (numberingFormat != null && numberingFormat.FormatCode.Value.Contains("/yy")) //TODO here i should think of locales
                            {
                                DateTime dt = DateTime.FromOADate(excelDate);
                                cellValue = dt.ToString("MM/dd/yyyy");
                            }
                        }
                    }
                }
            }
            return cellValue;
        }

public static string GetCellValue(WorkbookPart wbPart, WorksheetPart wsPart, string addressName)
        {
            string value = String.Empty; //code from microsoft prefers null, but null is tough to work with

            // Use its Worksheet property to get a reference to the cell 
            // whose address matches the address you supplied.
            Cell theCell = wsPart.Worksheet.Descendants<Cell>().
              Where(c => c.CellReference == addressName).FirstOrDefault();

            // If the cell does not exist, return an empty string.
            if (theCell != null)
            {
                value = theCell.InnerText;

                // If the cell represents an integer number, you are done. 
                // For dates, this code returns the serialized value that 
                // represents the date. The code handles strings and 
                // Booleans individually. For shared strings, the code 
                // looks up the corresponding value in the shared string 
                // table. For Booleans, the code converts the value into 
                // the words TRUE or FALSE.
                if (theCell.DataType != null)
                {
                    switch (theCell.DataType.Value)
                    {
                        case CellValues.SharedString:

                            // For shared strings, look up the value in the shared strings table.
                            var stringTable = wbPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();

                            // If the shared string table is missing, something is wrong. Return the index that is in the cell. 
                            //Otherwise, look up the correct text in the table.
                            if (stringTable != null)
                            {
                                value = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
                            }
                            break;

                        case CellValues.Boolean:
                            switch (value)
                            {
                                case "0":
                                    value = "FALSE";
                                    break;
                                default:
                                    value = "TRUE";
                                    break;
                            }
                            break;
                    }
                }
            }
            return value;
        }

0 个答案:

没有答案