警告:由于示例和结果而导致的长篇
这里有一些关于如何在列之间读取包含空单元格的Open XML电子表格行的线程。我从这里得到了一些答案reading Excel Open XML is ignoring blank cells
我能够很好地从xlsx读取一个表,但它比从CSV读取的速度快10倍,而开放的XML结构应该(?)产生更好的结果。
这是我测试代码库所得到的:
foreach (Row r in sheetData.Descendants<Row>())
{
sw.Start();
//find a row marked as "header" and get list of columns that define width of table
if (!headerRowFound)
{
headerRowFound = CheckOXMLHeaderRow(r, workbookPart, out headerReferences);
if (!headerRowFound)
continue;
}
rowKey++;
//////////////////////////////////////////////////////////////////////
///////////////////here we are going to do work//////////////////////
////////////////////////////////////////////////////////////////////
AddRow(rowKey, cols);
sw.Stop();
Debug.WriteLine("XLSX Row added in \t" + sw.ElapsedTicks.ToString() + "\tticks");
sw.Reset();
}
在我的数据中,一行是68个单元格,其中只有5-10个填写
0。为了进行比较,通过CSV行需要大约300个滴答(闪电般快速)。 5000行添加3ms
1。代码仅通过1-4个刻度处理行枚举器
2. 这段代码只是依次抓取所有单元格并将它们存储在一行中(由于OXML性质,列顺序搞砸了)
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, c);
}
//this takes about 8-10 times longer - 10-30 ticks , still lightning fast
3. 如果我们根据列(标题)名称和行号知道在哪里查找,我们可以这样做
Hashtable cols = new Hashtable();
foreach (string column in headerReferences.Values)
{
colKey++;
cols.Add(colKey, GetCellValue(workbookPart, worksheetPart, column + r.RowIndex.ToString()));
}
这是MSDN示例之一,它每行有500,000个滴答声。花了几分钟来解析5000行电子表格。不能接受的。 以下是连续的每个单元格,无论是否存在
4. 我决定缩减并尝试从所有传入的单元格中检索值到HashTable
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, GetValueFromCell(c, workbookPart));
}
现在每行500-1,500个刻度。如果我们只是存储没有任何顺序的值(还没有解决方案),那么闪电般快速。
5. 为了确保我保留列的顺序,我为每个新行创建标题行的空克隆,在我解析EXISTING单元格后,我根据Hashtable决定将它们放在何处检索
Hashtable cols = (Hashtable)emptyNewRow.Clone();
foreach (Cell c in r.Descendants<Cell>())
{
colKey = headerReferences[GetColumnName(c.CellReference)]; //what # column is this?
cols[colKey] = GetValueFromCell(c, workbookPart); //put value in that column
}
最终结果是每行9,000-20,000个刻度。 5,000个电子表格30秒。可行但不理想。
这是我停下来的地方。任何想法如何让它更快?如何能够快速加载大量的xlsx电子表格,以及我能做到的最好是5k行30秒?
字典对我没有任何作用,甚至没有1%的改进。无论如何,我需要在Hashtables中获得遗留改造的结果
附录:参考方法
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
public static string GetValueFromCell(Cell cell, WorkbookPart workbookPart)
{
int id;
string cellValue = cell.InnerText;
if (cellValue.Trim().Length > 0)
{
if (cell.DataType != null)
{
switch (cell.DataType.Value)
{
case CellValues.SharedString:
Int32.TryParse(cellValue, out id);
SharedStringItem item = GetSharedStringItemById(workbookPart, id);
if (item.Text != null)
{
cellValue = item.Text.Text;
}
else if (item.InnerText != null)
{
cellValue = item.InnerText;
}
else if (item.InnerXml != null)
{
cellValue = item.InnerXml;
}
break;
case CellValues.Boolean:
switch (cellValue)
{
case "0":
cellValue = "FALSE";
break;
default:
cellValue = "TRUE";
break;
}
break;
}
}
else
{
int excelDate;
if (Int32.TryParse(cellValue, out excelDate))
{
var styleIndex = (int)cell.StyleIndex.Value;
var cellFormats = workbookPart.WorkbookStylesPart.Stylesheet.CellFormats;
var numberingFormats = workbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats;
var cellFormat = (CellFormat)cellFormats.ElementAt(styleIndex);
if (cellFormat.NumberFormatId != null)
{
var numberFormatId = cellFormat.NumberFormatId.Value;
var numberingFormat = numberingFormats.Cast<NumberingFormat>().SingleOrDefault(f => f.NumberFormatId.Value == numberFormatId);
if (numberingFormat != null && numberingFormat.FormatCode.Value.Contains("/yy")) //TODO here i should think of locales
{
DateTime dt = DateTime.FromOADate(excelDate);
cellValue = dt.ToString("MM/dd/yyyy");
}
}
}
}
}
return cellValue;
}
public static string GetCellValue(WorkbookPart wbPart, WorksheetPart wsPart, string addressName)
{
string value = String.Empty; //code from microsoft prefers null, but null is tough to work with
// Use its Worksheet property to get a reference to the cell
// whose address matches the address you supplied.
Cell theCell = wsPart.Worksheet.Descendants<Cell>().
Where(c => c.CellReference == addressName).FirstOrDefault();
// If the cell does not exist, return an empty string.
if (theCell != null)
{
value = theCell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (theCell.DataType != null)
{
switch (theCell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the shared strings table.
var stringTable = wbPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
// If the shared string table is missing, something is wrong. Return the index that is in the cell.
//Otherwise, look up the correct text in the table.
if (stringTable != null)
{
value = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
}
return value;
}