将多个HTML文件作为单独的工作表导入到Excel中

时间:2018-08-15 14:15:35

标签: python html excel python-3.x win32com

我有很多HTML文件,我需要打开它们或将它们导入到单个Excel工作簿中,然后简单地保存工作簿。每个HTML文件都应位于工作簿中其自己的工作表上。

我现有的代码不起作用,它在workbook.Open(html)行上崩溃了,很可能在随后的行上崩溃。在此主题的网上搜索找不到任何内容。

import win32com.client as win32
import pathlib as path


def save_html_files_to_worksheets(read_directory):
    read_path = path.Path(read_directory)
    save_path = read_path.joinpath('Single_Workbook_Containing_HTML_Files.xlsx')

    excel_app = win32.gencache.EnsureDispatch('Excel.Application')
    workbook = excel_app.Workbooks.Add()  # create a new excel workbook

    indx = 1  # used to add new worksheets dependent on number of html files
    for html in read_path.glob('*.html'):  # loop through directory getting html files
        workbook.Open(html)  # open the html in the newly created workbook - this doesn't work though
        worksheet = workbook.Worksheets(indx)  # each iteration in loop add new worksheet
        worksheet.Name = 'Test' + str(indx)  # name added worksheets
        indx += 1
    workbook.SaveAs(str(save_path), 51)  # win32com requires string like path, 51 is xlsx extension
    excel_app.Application.Quit()


save_html_files_to_worksheets(r'C:\Users\<UserName>\Desktop\HTML_FOLDER')

如果有帮助,以下代码可以满足我的需求。它将把每个HTML文件转换成一个单独的Excel文件。我需要一个包含多个WorkSheets的Excel文件中的每个HTML文件。

import win32com.client as win32
import pathlib as path

def save_as_xlsx(read_directory):
    read_path = path.Path(read_directory)
    excel_app = win32.gencache.EnsureDispatch('Excel.Application')

    for html in read_path.glob('*.html'):
        save_path = read_path.joinpath(html.stem + '.xlsx')
        wb = excel_app.Workbooks.Open(html)
        wb.SaveAs(str(save_path), 51)
    excel_app.Application.Quit()


save_as_xlsx(r'C:\Users\<UserName>\Desktop\HTML_FOLDER')

这是您可以使用的示例HTML文件的链接,该文件中的数据不是真实的:HTML Download Link

3 个答案:

答案 0 :(得分:2)

一种解决方案是将HTML文件打开到一个临时工作簿中,然后从该工作簿中复制工作表到包含所有工作簿的工作簿中:

workbook = excel_app.Application.Workbooks.Add()
sheet = workbook.Sheets(1)
for path in read_path.glob('*.html'):
    workbook_tmp = excel_app.Application.Workbooks.Open(path)
    workbook_tmp.Sheets(1).Copy(Before=sheet)
    workbook_tmp.Close()
# Remove the redundant 'Sheet1'
excel_app.Application.ShowAlerts = False    
sheet.Delete()
excel_app.Application.ShowAlerts = True

答案 1 :(得分:0)

我相信with tmp_dob as ( select to_date('19900101', 'YYYYMMDD') as Birthday from dual union all select to_date('19901231', 'YYYYMMDD') Birthday from dual union all select to_date('20040229', 'YYYYMMDD') Birthday from dual union all select to_date('20041231', 'YYYYMMDD') Birthday from dual union all select to_date('20171231', 'YYYYMMDD') Birthday from dual union all select to_date('20051231', 'YYYYMMDD') Birthday from dual ) select Birthday, add_months(birthday, 12 * (extract(year from sysdate) - extract(year from birthday))) from tmp_dob; 将使您的工作更加轻松。

pandas

这里是一个示例,说明如何从Wikipedia html获取多个表并将其输入到Pandas DataFrame中并将其保存到磁盘。

pip install pandas

对于您的用例,应该可以执行以下操作:

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_American_films_of_2017"
wikitables = pd.read_html(url, header=0, attrs={"class":"wikitable"})
for idx,df in enumerate(wikitables):
    df.to_csv('{}.csv'.format(idx),index=False)

**确保在import pathlib as path import pandas as pd def save_as_xlsx(read_directory): read_path = path.Path(read_directory) for html in read_path.glob('*.html'): save_path = read_path.joinpath(html.stem + '.xlsx') dfs_from_html = pd.read_html(html, header=0,) for idx, df in enumerate(dfs_from_html): df.to_excel('{}.xlsx'.format(idx),index=False) 函数中设置正确的html属性。

答案 2 :(得分:-1)

怎么样?

Sub From_XML_To_XL()
'UpdatebyKutoolsforExcel20151214
    Dim xWb As Workbook
    Dim xSWb As Workbook
    Dim xStrPath As String
    Dim xFileDialog As FileDialog
    Dim xFile As String
    Dim xCount As Long
    On Error GoTo ErrHandler
    Set xFileDialog = Application.FileDialog(msoFileDialogFolderPicker)
    xFileDialog.AllowMultiSelect = False
    xFileDialog.Title = "Select a folder [Kutools for Excel]"
    If xFileDialog.Show = -1 Then
        xStrPath = xFileDialog.SelectedItems(1)
    End If
    If xStrPath = "" Then Exit Sub
    Application.ScreenUpdating = False
    Set xSWb = ThisWorkbook
    xCount = 1
    xFile = Dir(xStrPath & "\*.xml")
    Do While xFile <> ""
        Set xWb = Workbooks.OpenXML(xStrPath & "\" & xFile)
        xWb.Sheets(1).UsedRange.Copy xSWb.Sheets(1).Cells(xCount, 1)
        xWb.Close False
        xCount = xSWb.Sheets(1).UsedRange.Rows.Count + 2
        xFile = Dir()
    Loop
    Application.ScreenUpdating = True
    xSWb.Save
    Exit Sub
ErrHandler:
    MsgBox "no files xml", , "Kutools for Excel"
End Sub