我有很多包含数据的excel文件,它包含空行和空列。 如下所示
我正在尝试使用interop从excel中删除Empty行和列。 我创建一个简单的winform应用程序并使用以下代码,它工作正常。
Dim lstFiles As New List(Of String)
lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories))
Dim m_XlApp = New Excel.Application
Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks
Dim m_xlWrkb As Excel.Workbook
For Each strFile As String In lstFiles
m_xlWrkb = m_xlWrkbs.Open(strFile)
Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1)
Dim intRow As Integer = 1
While intRow <= m_XlWrkSheet.UsedRange.Rows.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then
m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp)
Else
intRow += 1
End If
End While
Dim intCol As Integer = 1
While intCol <= m_XlWrkSheet.UsedRange.Columns.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then
m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft)
Else
intCol += 1
End If
End While
Next
m_xlWrkb.Save()
m_xlWrkb.Close(SaveChanges:=True)
Marshal.ReleaseComObject(m_xlWrkb)
Marshal.ReleaseComObject(m_xlWrkbs)
m_XlApp.Quit()
Marshal.ReleaseComObject(m_XlApp)
但是在清理大型Excel文件时需要花费很多时间。 有关优化此代码的任何建议吗?或者其他方式更快地清理这个excel文件?是否有一个函数可以一次删除空行?
如果答案使用C#
,我没有问题修改
我上传了一个示例文件Sample File。但并非所有文件都具有相同的结构。
答案 0 :(得分:18)
我发现如果工作表很大,循环遍历excel工作表可能需要一些时间。所以我的解决方案试图避免工作表中的任何循环。为了避免在工作表中循环,我从usedRange
返回的单元格中创建了一个二维对象数组:
Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;
这是我循环访问的数组,以获取空行和列的索引。我制作2个int列表,一个保留行索引以删除另一个保持列索引删除。
List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);
这些列表将从高到低排序,以简化从下到上删除行并从右到左删除列的过程。然后只需循环遍历每个列表并删除相应的行/列。
DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);
最后删除所有空行和列后,我将文件另存为新文件名。
希望这有帮助。
修改强>
解决了UsedRange问题,如果工作表顶部有空行,那么现在将删除这些行。此外,这将删除起始数据左侧的所有空列。即使在数据启动之前存在空行或列,这也允许索引正常工作。 这是通过获取UsedRange中第一个单元格的地址来完成的,这将是“$ A $ 1:$ D $ 4”形式的地址。如果要保留并且不删除左侧的空行和左侧的空列,这将允许使用偏移量。在这种情况下,我只是删除它们。要从顶部删除要删除的行数,可以通过第一个“$ A $ 4”地址计算,其中“4”是第一个数据出现的行。所以我们需要删除前3行。列地址的格式为“A”,“AB”或甚至“AAD”,这需要一些翻译,并且感谢How to convert a column number (eg. 127) into an excel column (eg. AA)我能够确定左侧需要删除多少列。
class Program {
static void Main(string[] args) {
Excel.Application excel = new Excel.Application();
string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
Excel.Range usedRange = worksheet.UsedRange;
RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange);
DeleteEmptyRowsCols(worksheet);
string newPath = @"H:\ExcelTestFolder\Book1_Test_Removed.xls";
workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
excel.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
Console.WriteLine("Finished removing empty rows and columns - Press any key to exit");
Console.ReadKey();
}
private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet) {
Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;
int totalRows = targetCells.Rows.Count;
int totalCols = targetCells.Columns.Count;
List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);
// now we have a list of the empty rows and columns we need to delete
DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);
}
private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet) {
// the rows are sorted high to low - so index's wont shift
foreach (int rowIndex in rowsToDelete) {
worksheet.Rows[rowIndex].Delete();
}
}
private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet) {
// the cols are sorted high to low - so index's wont shift
foreach (int colIndex in colsToDelete) {
worksheet.Columns[colIndex].Delete();
}
}
private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols) {
List<int> emptyRows = new List<int>();
for (int i = 1; i < totalRows; i++) {
if (IsRowEmpty(allValues, i, totalCols)) {
emptyRows.Add(i);
}
}
// sort the list from high to low
return emptyRows.OrderByDescending(x => x).ToList();
}
private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols) {
List<int> emptyCols = new List<int>();
for (int i = 1; i < totalCols; i++) {
if (IsColumnEmpty(allValues, i, totalRows)) {
emptyCols.Add(i);
}
}
// sort the list from high to low
return emptyCols.OrderByDescending(x => x).ToList();
}
private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows) {
for (int i = 1; i < totalRows; i++) {
if (allValues[i, colIndex] != null) {
return false;
}
}
return true;
}
private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols) {
for (int i = 1; i < totalCols; i++) {
if (allValues[rowIndex, i] != null) {
return false;
}
}
return true;
}
private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange) {
string addressString = usedRange.Address.ToString();
int rowsToDelete = GetNumberOfTopRowsToDelete(addressString);
DeleteTopEmptyRows(worksheet, rowsToDelete);
int colsToDelete = GetNumberOfLeftColsToDelte(addressString);
DeleteLeftEmptyColumns(worksheet, colsToDelete);
}
private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow) {
for (int i = 0; i < startRow - 1; i++) {
worksheet.Rows[1].Delete();
}
}
private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount) {
for (int i = 0; i < colCount - 1; i++) {
worksheet.Columns[1].Delete();
}
}
private static int GetNumberOfTopRowsToDelete(string address) {
string[] splitArray = address.Split(':');
string firstIndex = splitArray[0];
splitArray = firstIndex.Split('$');
string value = splitArray[2];
int returnValue = -1;
if ((int.TryParse(value, out returnValue)) && (returnValue >= 0))
return returnValue;
return returnValue;
}
private static int GetNumberOfLeftColsToDelte(string address) {
string[] splitArray = address.Split(':');
string firstindex = splitArray[0];
splitArray = firstindex.Split('$');
string value = splitArray[1];
return ParseColHeaderToIndex(value);
}
private static int ParseColHeaderToIndex(string colAdress) {
int[] digits = new int[colAdress.Length];
for (int i = 0; i < colAdress.Length; ++i) {
digits[i] = Convert.ToInt32(colAdress[i]) - 64;
}
int mul = 1; int res = 0;
for (int pos = digits.Length - 1; pos >= 0; --pos) {
res += digits[pos] * mul;
mul *= 26;
}
return res;
}
}
编辑2:为了进行测试,我创建了一个循环通过工作表的方法,并将其与通过对象数组循环的代码进行比较。它显示出显着的差异。
循环通过工作表并删除空行和列的方法。
enum RowOrCol { Row, Column };
private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet) {
Excel.Range usedRange = worksheet.UsedRange;
int totalRows = usedRange.Rows.Count;
int totalCols = usedRange.Columns.Count;
RemoveEmpty(usedRange, RowOrCol.Row);
RemoveEmpty(usedRange, RowOrCol.Column);
}
private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol) {
int count;
Excel.Range curRange;
if (rowOrCol == RowOrCol.Column)
count = usedRange.Columns.Count;
else
count = usedRange.Rows.Count;
for (int i = count; i > 0; i--) {
bool isEmpty = true;
if (rowOrCol == RowOrCol.Column)
curRange = usedRange.Columns[i];
else
curRange = usedRange.Rows[i];
foreach (Excel.Range cell in curRange.Cells) {
if (cell.Value != null) {
isEmpty = false;
break; // we can exit this loop since the range is not empty
}
else {
// Cell value is null contiue checking
}
} // end loop thru each cell in this range (row or column)
if (isEmpty) {
curRange.Delete();
}
}
}
然后是Main用于测试/计时两种方法。
enum RowOrCol { Row, Column };
static void Main(string[] args)
{
Excel.Application excel = new Excel.Application();
string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
Excel.Range usedRange = worksheet.UsedRange;
// Start test for looping thru each excel worksheet
Stopwatch sw = new Stopwatch();
Console.WriteLine("Start stopwatch to loop thru WORKSHEET...");
sw.Start();
ConventionalRemoveEmptyRowsCols(worksheet);
sw.Stop();
Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
string newPath = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls";
workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
Console.WriteLine("");
// Start test for looping thru object array
workbook = excel.Workbooks.Open(originalPath);
worksheet = workbook.Worksheets["Sheet1"];
usedRange = worksheet.UsedRange;
Console.WriteLine("Start stopwatch to loop thru object array...");
sw = new Stopwatch();
sw.Start();
DeleteEmptyRowsCols(worksheet);
sw.Stop();
// display results from second test
Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
string newPath2 = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls";
workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
excel.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
Console.WriteLine("");
Console.WriteLine("Finished testing methods - Press any key to exit");
Console.ReadKey();
}
编辑3 根据OP请求... 我更新并更改了代码以匹配OP代码。有了这个我发现了一些有趣的结果。见下文。
我更改了代码以匹配您正在使用的函数,即... EntireRow和CountA。下面的代码我发现它的表现非常糟糕。运行一些测试我发现下面的代码执行时间超过800毫秒。然而,一个微妙的变化带来了巨大的变化。
在线:
while (rowIndex <= worksheet.UsedRange.Rows.Count)
这会减慢很多事情。如果你为UsedRang创建一个范围变量而不是保持regrabbibg它与while循环的每次迭代将产生巨大的差异。所以...当我将while循环更改为...时
Excel.Range usedRange = worksheet.UsedRange;
int rowIndex = 1;
while (rowIndex <= usedRange.Rows.Count)
and
while (colIndex <= usedRange.Columns.Count)
这与我的对象阵列解决方案非常接近。我没有发布结果,因为您可以使用下面的代码并更改while循环以在每次迭代时获取UsedRange或使用变量usedRange来测试它。
private static void RemoveEmptyRowsCols3(Excel.Worksheet worksheet) {
//Excel.Range usedRange = worksheet.UsedRange; // <- using this variable makes the while loop much faster
int rowIndex = 1;
// delete empty rows
//while (rowIndex <= usedRange.Rows.Count) // <- changing this one line makes a huge difference - not grabbibg the UsedRange with each iteration...
while (rowIndex <= worksheet.UsedRange.Rows.Count) {
if (excel.WorksheetFunction.CountA(worksheet.Cells[rowIndex, 1].EntireRow) == 0) {
worksheet.Cells[rowIndex, 1].EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp);
}
else {
rowIndex++;
}
}
// delete empty columns
int colIndex = 1;
// while (colIndex <= usedRange.Columns.Count) // <- change here also
while (colIndex <= worksheet.UsedRange.Columns.Count) {
if (excel.WorksheetFunction.CountA(worksheet.Cells[1, colIndex].EntireColumn) == 0) {
worksheet.Cells[1, colIndex].EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft);
}
else {
colIndex++;
}
}
}
按@Hadi更新
如果excel在上次使用后包含额外的空白行和列,则可以更改DeleteCols
和DeleteRows
函数以获得更好的性能:
private static void DeleteRows(List<int> rowsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
// the rows are sorted high to low - so index's wont shift
List<int> NonEmptyRows = Enumerable.Range(1, rowsToDelete.Max()).ToList().Except(rowsToDelete).ToList();
if (NonEmptyRows.Max() < rowsToDelete.Max())
{
// there are empty rows after the last non empty row
Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[NonEmptyRows.Max() + 1,1];
Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[rowsToDelete.Max(), 1];
//Delete all empty rows after the last used row
worksheet.Range[cell1, cell2].EntireRow.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftUp);
} //else last non empty row = worksheet.Rows.Count
foreach (int rowIndex in rowsToDelete.Where(x => x < NonEmptyRows.Max()))
{
worksheet.Rows[rowIndex].Delete();
}
}
private static void DeleteCols(List<int> colsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
// the cols are sorted high to low - so index's wont shift
//Get non Empty Cols
List<int> NonEmptyCols = Enumerable.Range(1, colsToDelete.Max()).ToList().Except(colsToDelete).ToList();
if (NonEmptyCols.Max() < colsToDelete.Max())
{
// there are empty rows after the last non empty row
Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[1,NonEmptyCols.Max() + 1];
Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[1,NonEmptyCols.Max()];
//Delete all empty rows after the last used row
worksheet.Range[cell1, cell2].EntireColumn.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftToLeft);
} //else last non empty column = worksheet.Columns.Count
foreach (int colIndex in colsToDelete.Where(x => x < NonEmptyCols.Max()))
{
worksheet.Columns[colIndex].Delete();
}
}
在Get Last non empty column and row index from excel using Interop
检查我的回答答案 1 :(得分:5)
也许需要考虑的事情:
Sub usedRangeDeleteRowsCols()
Dim LastRow, LastCol, i As Long
LastRow = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByRows).Row
LastCol = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByColumns).Column
For i = LastRow To 1 Step -1
If WorksheetFunction.CountA(Range(Cells(i, 1), Cells(i, LastCol))) = 0 Then
Cells(i, 1).EntireRow.Delete
End If
Next
For i = LastCol To 1 Step -1
If WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i))) = 0 Then
Cells(1, i).EntireColumn.Delete
End If
Next
End Sub
我认为与原始代码中的等效函数相比,有两种效率。首先,我们不是使用Excel的不可靠的UsedRange属性,而是找到最后一个值,只扫描真实使用范围内的行和列。
其次,工作表计数功能仅在真正使用的范围内起作用 - 例如,当搜索空白行时,我们只查看已使用列的范围(而不是.EntireRow
)。
For
循环向后工作,例如,每次删除行时,后续数据的行地址都会发生变化。向后工作意味着“要处理的数据”的行地址不会改变。
答案 2 :(得分:2)
在我看来,最耗时的部分可能是枚举和查找空行和列。
修改强>
怎么样:
m_XlWrkSheet.Columns("A:A").SpecialCells(xlCellTypeBlanks).EntireRow.Delete
m_XlWrkSheet.Rows("1:1").SpecialCells(xlCellTypeBlanks).EntireColumn.Delete
测试样本数据结果看起来不错,性能更好(从VBA测试但差异很大)。
<强>更新强>
使用14k行(由样本数据制作)原始代码~30秒对样本Excel进行测试,此版本<1s
答案 3 :(得分:2)
我所知道的最简单的方法是隐藏非空白单元格并删除可见的单元格:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h> //compile and link with -pthread
#include <semaphore.h>
#define BUFFER_SIZE 10
int buffer[BUFFER_SIZE];
int in, out;
int num;
semaphore s = 1, n =0, e = BUFFER_SIZE;
void *produce( void *ptr ) {
int item;
while(1){
usleep(rand() % 1000); //sleep
printf("Producer wants to produce.\n");
semWait(e);
semWait(s);
//critical section below
item = rand() % 10;
num++;
buffer[in] = item;
in = (in + 1) % BUFFER_SIZE;
//critical section above
semSignal(s);
semSignal(n);
printf("Producer entered %d. Buffer Size = %d\n", item, num);
}
}
void *consume( void *ptr ) {
int item;
while(1){
usleep(rand() % 1000); //sleep
printf("Consumer wants to consume.\n");
semWait(n);
semWait(s);
//critical section below
num--;
int item = buffer[out];
out = (out + 1) % BUFFER_SIZE;
//critical section above
semSignal(s);
semSignal(e);
printf("Consumer consumed %d. Buffer Size = %d\n", item, num);
}
}
int main() {
pthread_t consumer, producer;
out = 0; //index of the item to be consumed next
in = 0; //index of the item to be produced next
num = 0; //number of items in the buffer
srand(time(NULL));
pthread_create(&consumer, NULL, consume, NULL);
pthread_create(&producer, NULL, produce, NULL);
pthread_join(producer, NULL);
pthread_join(consumer, NULL);
pthread_exit(NULL);
}
更快的方法是不删除任何东西,而是移动(剪切+粘贴)非空白区域。
最快的Interop方式(没有打开文件的方法更复杂)是获取数组中的所有值,移动数组中的值,然后将值放回:
view(response)
答案 4 :(得分:0)
您可以打开与工作表的ADO连接,获取字段列表,发出仅包含已知字段的SQL语句,还可以排除已知字段中没有值的记录。