Question

我想将GtkTextView中的格式化文本提取为 html 或 pango标记语言。

使用小型文本编辑器，格式为this。因此，格式化元素很简单 <b>，<i>等

有没有办法从TexView获取格式化文本？

Answer 1

您可以使用gtk_text_buffer_serialize()。但是，唯一内置的序列化程序是GTK的内部文本缓冲区格式，所以如果你想要HTML或Pango标记，你必须自己编写序列化函数。

几年前我写了GtkTextBuffer serializer for RTF。我不知道它是否会帮助你或鼓励你自己写作。

Answer 2

我需要使用Pango富文本格式转换Gtk TextBuffer的内容 HTML中的标记（数据存储的格式）应用程序，与您的要求类似。

我找不到开箱即用的简单方法，但最终我从Gtk序列化的内容编写我自己的 converter 到html。

它使用了html，它是标准库的一部分，正如我们已经将BeautifulSoup4作为依赖项，它也可以利用它。

首先，我们定义了一个从Gtk.TextBuffer派生的类，该类将覆盖 get_text方法，在以下情况下以文本或HTML形式返回内容设置了include_hidden_chars：

class PangoBuffer(Gtk.TextBuffer):

    def get_text(self,
                  start: Optional[Gtk.TextIter] = None,
                  end: Optional[Gtk.TextIter] = None,
                  include_hidden_chars: bool = False) -> str:
         """Get the buffer content.

         If `include_hidden_chars` is set, then the html markup content is
         returned. If False, then the text only is returned."""
         if start is None:
             start = self.get_start_iter()
         if end is None:
             end = self.get_end_iter()

         if include_hidden_chars is False:
             return super().get_text(start, end, include_hidden_chars=False)
         else:
             format_ = self.register_serialize_tagset()
             content = self.serialize(self, format_, start, end)
             return PangoToHtml().feed(content)

重要部分在else块中。我本来希望来开发我自己的序列化程序，但是文档很少。我们因此，请使用内置的序列化程序，以二进制内容重新显示。

此内容基本上是带有额外的页眉和页脚的XML标记：

# Truncated for legibility.

GTKTEXTBUFFERCONTENTS-0001\x00\x00\x07Z
<text_view_markup>
     <tags>
        <tag id="12" priority="12"> </tag>  # Tags can be empty
        <tag name="italic" priority="2">
            <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />
         </tag>
         <tag id="7" priority="7">
             <attr name="background-gdk" type="GdkColor" value="0:0:ffff" />
             <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />
             <attr name="weight" type="gint" value="700" />
         </tag>
     </tags>
     <text>
         <apply_tag name="italic">This is italic</apply_tag>
         <apply_tag id="1">. </apply_tag>
         <apply_tag id="2">This is italic</apply_tag>
         <apply_tag id="3">\n            </apply_tag>
         <apply_tag id="7">This is bold, italic, and has background colouring.</apply_tag>
     </text>
</text_view_markup>

据此，我们可以确定标签未排序，并且它们可以具有 id或name。

包含id的

标记称为 anonymous ，通常是反序列化内容时由Pango创建。

命名标签通常是您的应用程序中定义的标签：

tag_bold = TextBuffer.create_tag("bold", weight=Pango.Weight.BOLD)
tag_italic = TextBuffer.create_tag("italic", style=Pango.Style.ITALIC)
tag_underline = TextBuffer.create_tag("underline", underline=Pango.Underline.SINGLE)

标头包含一个校验和，在调用时校验和可能不会反序列化 bytes.decode，因此在解码为xml之前必须将其删除字符串。

然后PangoToHtml类完成实际的工作：

from html.parser import HTMLParser
from typing import Dict, List, Tuple

from bs4 import BeautifulSoup
from bs4.element import Tag
from gi.repository import Pango


class PangoToHtml(HTMLParser):
    """Decode a subset of Pango markup and serialize it as HTML.

    Only the Pango markup used within Gourmet is handled, although expanding it
    is not difficult.

    Due to the way that Pango attributes work, the HTML is not necessarily the
    simplest. For example italic tags may be closed early and reopened if other
    attributes, eg. bold, are inserted mid-way:

        <i> italic text </i><i><u>and underlined</u></i>

    This means that the HTML resulting from the conversion by this object may
    differ from the original that was fed to the caller.
    """
    def __init__(self):
        super().__init__()
        self.markup_text: str = ""  # the resulting content
        self.current_opening_tags: str = ""  # used during parsing
        self.current_closing_tags: List = []  # used during parsing

        # The key is the Pango id of a tag, and the value is a tuple of opening
        # and closing html tags for this id.
        self.tags: Dict[str: Tuple[str, str]] = {}

        # Optionally, links can be specified, in a {link text: target} format.
        self.links: Dict[str, str] = {}

        # If links are specified, it is possible to ignore them, as is done with
        # time links.
        self.ignore_links: bool = False

        # Used as heuristics for parsing links, when applicable.
        self.is_colored_and_underlined: bool = False

    tag2html: Dict[str, Tuple[str, str]] = {
        Pango.Style.ITALIC.value_name: ("<i>", "</i>"),  # Pango doesn't do <em>
        str(Pango.Weight.BOLD.real): ("<b>", "</b>"),
        Pango.Underline.SINGLE.value_name: ("<u>", "</u>"),
        "foreground-gdk": (r'<span foreground="{}">', "</span>"),
        "background-gdk": (r'<span background="{}">', "</span>")
    }

    @staticmethod
    def pango_to_html_hex(val: str) -> str:
        """Convert 32 bit Pango color hex string to 16 html.

        Pango string have the format 'ffff:ffff:ffff' (for white).
        These values get truncated to 16 bits per color into a single string:
        '#FFFFFF'.
        """
        red, green, blue = val.split(":")
        red = hex(255 * int(red, base=16) // 65535)[2:].zfill(2)
        green = hex(255 * int(green, base=16) // 65535)[2:].zfill(2)
        blue = hex(255 * int(blue, base=16) // 65535)[2:].zfill(2)
        return f"#{red}{green}{blue}"

    def feed(self, data: bytes) -> str:
        """Convert a buffer (text and and the buffer's iterators to html string.

        Unlike an HTMLParser, the whole string must be passed at once, chunks
        are not supported.

        Optionally, a dictionary of links, in the format {text: target}, can be
        specified. Links will be inserted if some text in the markup will be
        coloured, underlined, and matching an entry in the dictionary.

        If `ignore_links` is set, along with the `links` dictionary, then links
        will be serialized as regular text, and the link targets will be lost.
        """
        # Remove the Pango header: it contains a length mark, which we don't
        # care about, but which does not necessarily decodes as valid char.
        header_end = data.find(b"<text_view_markup>")
        data = data[header_end:].decode()

        # Get the tags
        tags_begin = data.index("<tags>")
        tags_end = data.index("</tags>") + len("</tags>")
        tags = data[tags_begin:tags_end]
        data = data[tags_end:]

        # Get the textual content
        text_begin = data.index("<text>")
        text_end = data.index("</text>") + len("</text>")
        text = data[text_begin:text_end]

        # Convert the tags to html.
        # We know that only a subset of HTML is handled in Gourmet:
        # italics, bold, underlined, normal, and links (coloured & underlined)
        soup = BeautifulSoup(tags, features="lxml")
        tags = soup.find_all("tag")

        tags_list = {}
        for tag in tags:
            opening_tags = ""
            closing_tags = ""

            # The tag may have a name, for named tags, or else an id
            tag_name = tag.attrs.get('id')
            tag_name = tag.attrs.get('name', tag_name)

            attributes = [c for c in tag.contents if isinstance(c, Tag)]
            for attribute in attributes:
                vtype = attribute['type']
                value = attribute['value']
                name = attribute['name']

                if vtype == "GdkColor":  # Convert colours to html
                    if name in ['foreground-gdk', 'background-gdk']:
                        opening, closing = self.tag2html[name]
                        hex_color = self.pango_to_html_hex(value)
                        opening = opening.format(hex_color)
                    else:
                        continue  # no idea!
                else:
                    opening, closing = self.tag2html[value]

                opening_tags += opening
                closing_tags = closing + closing_tags   # closing tags are FILO

            tags_list[tag_name] = opening_tags, closing_tags

            if opening_tags:
                tags_list[tag_name] = opening_tags, closing_tags

        self.tags = tags_list

        # Create a single output string that will be sequentially appended to
        # during feeding of text. It can then be returned once we've parse all
        self.markup_text = ""
        self.current_opening_tags = ""
        self.current_closing_tags = []  # Closing tags are FILO
        self.is_colored_and_underlined = False

        super().feed(text)

        return self.markup_text

    def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
        # The pango tags are either "apply_tag", or "text". We only really care
        # about the "apply_tag". There could be an assert, but we let the
        # parser quietly handle nonsense.
        if tag == "apply_tag":
            attrs = dict(attrs)
            tag_name = attrs.get('id')  # A tag may have a name, or else an id
            tag_name = attrs.get('name', tag_name)
            tags = self.tags.get(tag_name)

            if tags is not None:
                self.current_opening_tags, closing_tag = tags
                self.current_closing_tags.append(closing_tag)

    def handle_data(self, data: str) -> None:
        data = self.current_opening_tags + data
        self.markup_text += data

    def handle_endtag(self, tag: str) -> None:
        if self.current_closing_tags:  # Can be empty due to closing "text" tag
            self.markup_text += self.current_closing_tags.pop()
        self.current_opening_tags = ""

按照 HTMLParser文档， 用作解析文本文件的基础格式为HTML（超文本标记语言）和XHTML。我们知道我们要处理开始和结束标签，以及它们之间的内容。

在序列化的内容中，标记通过其名称或ID进行引用，因此必须事先进行处理。

在这种情况下，我选择使用BeautifulSoup，因为它提供了一种简单的方法在一个简单的循环中遍历XML标签。

整个事情是否只能由BeautifulSoup或 html库？可能是的，但是我需要各种链接的支持，因此end result 有所不同，因为我需要HTMLParser提供的灵活性。

这是一个基本的单元测试：

from pango_html import PangoToHtml


def test_convert_colors_to_html():
    val = "0:0:0"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#000000"

    val = "ffff:0:0"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#ff0000"

    val = "0:ffff:0"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#00ff00"

    val = "0:0:ffff"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#0000ff"

    val = "ffff:ffff:ffff"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#ffffff"

    val = "0:00000000:ffff"  # add some arbitrary amounts of leading zeroes
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#0000ff"

    val = "ff00:d700:0000"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#fed600"  # Gold

    val = "ffff:1414:9393"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#ff1493"  # Deep Pink

    val = "4747:5f5f:9494"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#475f94"  # Some Blue

    val = "00fd:ffdc:ff5c"
    ret = PangoToHtml.pango_to_html_hex(val)
    assert ret == "#00fefe"  # Some other blue


def test_pango_markup_to_html():
    # These are examples found throughout the application

    pango_markup = b'GTKTEXTBUFFERCONTENTS-0001\x00\x00\x07Z <text_view_markup>\n <tags>\n  <tag id="12" priority="12">\n  </tag>\n  <tag id="2" priority="2">\n   <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n  </tag>\n  <tag id="8" priority="8">\n  </tag>\n  <tag id="3" priority="3">\n  </tag>\n  <tag id="7" priority="7">\n   <attr name="background-gdk" type="GdkColor" value="0:0:ffff" />\n  </tag>\n  <tag id="4" priority="4">\n   <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n   <attr name="weight" type="gint" value="700" />\n  </tag>\n  <tag id="5" priority="5">\n   <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n   <attr name="weight" type="gint" value="700" />\n   <attr name="underline" type="PangoUnderline" value="PANGO_UNDERLINE_SINGLE" />\n  </tag>\n  <tag id="0" priority="0">\n   <attr name="weight" type="gint" value="700" />\n  </tag>\n  <tag id="1" priority="1">\n  </tag>\n  <tag id="6" priority="6">\n  </tag>\n  <tag id="9" priority="9">\n   <attr name="foreground-gdk" type="GdkColor" value="0:0:ffff" />\n  </tag>\n  <tag id="11" priority="11">\n   <attr name="background-gdk" type="GdkColor" value="0:0:ffff" />\n   <attr name="foreground-gdk" type="GdkColor" value="ffff:ffff:ffff" />\n  </tag>\n  <tag id="10" priority="10">\n  </tag>\n </tags>\n<text><apply_tag id="0">This is bold</apply_tag><apply_tag id="1">. </apply_tag><apply_tag id="2">This is italic</apply_tag><apply_tag id="3">\n            </apply_tag><apply_tag id="4">This is bold, italic, and </apply_tag><apply_tag id="5">underlined!</apply_tag><apply_tag id="6">\n            </apply_tag><apply_tag id="7">This is a test of bg color</apply_tag><apply_tag id="8">\n            </apply_tag><apply_tag id="9">This is a test of fg color</apply_tag><apply_tag id="10">\n            </apply_tag><apply_tag id="11">This is a test of fg and bg color</apply_tag><apply_tag id="12">\n           +</apply_tag></text>\n</text_view_markup>\n'  # noqa
    expected = '<b>This is bold</b>. <i>This is italic</i>\n            <i><b>This is bold, italic, and </b></i><i><b><u>underlined!</u></b></i>\n            <span background="#0000ff">This is a test of bg color</span>\n            <span foreground="#0000ff">This is a test of fg color</span>\n            <span background="#0000ff"><span foreground="#ffffff">This is a test of fg and bg color</span></span>\n           +'  # noqa

    ret = PangoToHtml().feed(pango_markup)
    assert ret == expected

    pango_markup = b'GTKTEXTBUFFERCONTENTS-0001\x00\x00\x01i <text_view_markup>\n <tags>\n  <tag name="italic" priority="1">\n   <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n  </tag>\n  <tag name="bold" priority="0">\n   <attr name="weight" type="gint" value="700" />\n  </tag>\n </tags>\n<text>ddf<apply_tag name="bold">fd<apply_tag name="italic">df</apply_tag>fd</apply_tag>dff</text>\n</text_view_markup>\n'  # noqa
    expected = 'ddf<b>fd<i>df</i>fd</b>dff'

    ret = PangoToHtml().feed(pango_markup)
    assert ret == expected

GTK TextView标签到pango标记

2 个答案: