Question

曾使用过几十种语言，但对Python来说是新手。

我的第一个（也许是第二个）问题，所以要温柔......

尝试有效地将类似HTML的降价文本转换为wiki格式（特别是Linux Tomboy / GNote注释到Zim），并且一直停留在转换列表上。

对于像这样的2级无序列表......

第一级
- 第二级

Tomboy / GNote使用类似......

的东西

<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>

然而，Zim个人维基希望这样......

* First level
  * Second level

...带有前导标签。

我已经探索了regex模块函数re.sub（），re.match（），re.search（）等，并发现了很酷的Python能力，可以将重复的文本编码为......

 count * "text"

因此，看起来应该有办法做某事......

 newnote = re.sub("<list>", LEVEL * "\t", oldnote)

LEVEL是注释中<list>的序数（出现）。因此，对于第一个0，<list>，对于第二个1等，它将是</list>。

每次遇到<list-item>时，LEVEL都会递减。

</list-item>标记转换为项目符号的星号（在适当的位置加上换行符），并删除{{1}}个标记。

最后......问题......

如何获取LEVEL的值并将其用作制表符倍增器？

Answer 1

您应该使用xml解析器来执行此操作，但要回答您的问题：

import re

def next_tag(s, tag):
    i = -1
    while True:
        try:
            i = s.index(tag, i+1)
        except ValueError:
            return
        yield i

a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"

a = a.replace("<list-item>", "* ")

for LEVEL, ind in enumerate(next_tag(a, "<list>")):
    a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)

a = a.replace("</list-item>", "")
a = a.replace("</list>", "")

print a

这适用于您的示例，仅限您的示例。使用XML解析器。您可以使用xml.dom.minidom（它包含在Python中（至少2.7），无需下载任何内容）：

import xml.dom.minidom

def parseList(el, lvl=0):
    txt = ""
    indent = "\t" * (lvl)
    for item in el.childNodes:
        # These are the <list-item>s: They can have text and nested <list> tag
        for subitem in item.childNodes:
            if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
                # This is the text before the next <list> tag
                txt += "\n" + indent + "* " + subitem.nodeValue
            else:
                # This is the next list tag, its indent level is incremented
                txt += parseList(subitem, lvl=lvl+1)
    return txt

def parseXML(s):
    doc = xml.dom.minidom.parseString(s)
    return parseList(doc.firstChild)

a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)

输出：

* First level
    * Second level
    * Second level 2
        * Third level

Answer 2

使用Beautiful汤，它允许您迭代标签，即使它们是习俗。这种操作非常实用

from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')]  for list_tag in soup('list')]

Output : [[u'First level'], [u'Second level']]

我使用了嵌套列表解析，但您可以使用嵌套for循环

for list_tag in soup('list'):
     for item in list_tag('list-item'):
         print item.text

我希望能帮助你。

在我的示例中，我使用了BeautifulSoup 3，但该示例应该与BeautifulSoup4一起使用，但只能导入更改。

from bs4 import BeautifulSoup

将HTML列表（<li>）转换为制表符（即缩进）</li>

2 个答案: