HTML到Word docx

时间:2018-11-15 19:17:24

标签: python beautifulsoup python-docx

我有BeautifulSoup附带的一些HTML格式的文本。我想通过docx运行命令将所有斜体(标签i),粗体(b)和链接(a href)转换为Word格式。 / p>

我可以做一个段落:

p = document.add_paragraph('text')

我可以将下一个序列添加为粗体/斜体:

p.add_run('bold').bold = True
p.add_run('italic.').italic = True

直觉上,我可以找到所有特定的标签(即soup.find_all('i')),然后观察索引,然后连接部分字符串...

...但是也许有更好,更优雅的方式?

我不希望将HTML页面转换为Word并保存它们的库或解决方案。我想要更多控制权。

我没有字典。这是代码,视觉错误(来自代码)和正确(理想)结果:

from docx import Document
import os
from bs4 import BeautifulSoup

html = '<a href="http://someurl.you">hi, I am link</a> this is some nice regular text. <i> oooh, but I am italic</i> ' \
        ' or I can be <b>bold</b> '\
        ' or even <i><b>bold and italic</b></i>'

def get_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    tags = {}
    tags["i"] = soup.find_all("i")
    tags["b"] = soup.find_all("b")

    return tags


def make_test_word():
    document = Document()

    document.add_heading('Demo HTML', 0)

    soup = BeautifulSoup(html, "html.parser")

    p = document.add_paragraph(html)

    # p.add_run('bold').bold = True
    # p.add_run(' and some ')
    # p.add_run('italic.').italic = True

    file_name="demo_html.docx"
    document.save(file_name)
    os.startfile(file_name)


make_test_word()

enter image description here

2 个答案:

答案 0 :(得分:0)

我刚刚编写了一些代码,将文本从tkinter文本小部件转换为Word文档,包括用户可以添加的任何粗体标签。对于您来说,这不是一个完整的解决方案,但它可能会帮助您开始寻找可行的解决方案。我认为您将需要做一些正则表达式工作才能将超链接转移到Word文档中。堆叠的格式标记也可能会很棘手。我希望这会有所帮助:

from docx import Document
import re
import docx
from docx.shared import Pt
from docx.enum.dml import MSO_THEME_COLOR_INDEX

def add_hyperlink(paragraph, text, url):
    # This gets access to the document.xml.rels file and gets a new relation id value
    part = paragraph.part
    r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)

    # Create the w:hyperlink tag and add needed values
    hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
    hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )

    # Create a w:r element and a new w:rPr element
    new_run = docx.oxml.shared.OxmlElement('w:r')
    rPr = docx.oxml.shared.OxmlElement('w:rPr')

    # Join all the xml elements together add add the required text to the w:r element
    new_run.append(rPr)
    new_run.text = text
    hyperlink.append(new_run)

    # Create a new Run object and add the hyperlink into it
    r = paragraph.add_run ()
    r._r.append (hyperlink)

    # A workaround for the lack of a hyperlink style (doesn't go purple after using the link)
    # Delete this if using a template that has the hyperlink style in it
    r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
    r.font.underline = True

    return hyperlink

html = '<H1>I want to</H1> <u>convert HTML <a href="http://www.google.com">to docx</a> in <b>bold and <i>bold italic</i></b>.</u>'

html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
tags = []
doc = Document()
p = doc.add_paragraph()
for run in html:
    tag_change = re.match('(?:<)(.*?)(?:>)', run)
    if tag_change != None:
        tag_strip = tag_change.group(0)
        tag_change = tag_change.group(1)
        if tag_change.startswith('/'):
            if tag_change.startswith('/a'):
                tag_change = next(tag for tag in tags if tag.startswith('a '))
            tag_change = tag_change.strip('/')
            tags.remove(tag_change)
        else:
            tags.append(tag_change)
    else:
        tag_strip = ''
    hyperlink = [tag for tag in tags if tag.startswith('a ')]
    if run.startswith('<'):
        run = run.replace(tag_strip, '')
        if hyperlink:
            hyperlink = hyperlink[0]
            hyperlink = re.match('.*?(?:href=")(.*?)(?:").*?', hyperlink).group(1)
            add_hyperlink(p, run, hyperlink)
        else:
            runner = p.add_run(run)
            if 'b' in tags:
                runner.bold = True
            if 'u' in tags:
                runner.underline = True
            if 'i' in tags:
                runner.italic = True
            if 'H1' in tags:
                runner.font.size = Pt(24)
    else:
        p.add_run(run)
doc.save('test.docx')

我回来了,它使得解析出多个格式标签成为可能。这将保持列表中正在播放的格式标签的计数。在每个标签处,都会创建一个新的运行,并通过播放中的当前标签来设置运行格式。

$twentyfive = '55';
$fifty = '80';
$seventyfive = '95';

借助this question实现了超链接功能。我在这里的担心是,您将需要为要保留到docx的每个HTML标记手动编码。我想那可能是很多。我提供了一些您可能要考虑的标签示例。

答案 1 :(得分:0)

或者,您可以将 html 代码保存为字符串并执行:

from htmldocx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.parse_html_file("html_filename", "docx_filename")
#Files extensions not needed, but tolerated