Question

所以我有一大段XML（例如：https://www.goodreads.com/author/list/20598?format=xml&key=pVrw9BAFGMTuvfj4Y8VHQ），我想搜索字符串的每个外观＆lt; title＆gt;，然后解析文本以获取实际标题并临时将其指定为变量的值，然后将该变量添加到列表中。

换句话说，请浏览此XML并获得列表标题。

然后我的问题（我在搜索中看到很多像这样的东西，但没有完全相同）：

1 - 我如何浏览整个文本，并在每次出现时停止。标题＆gt;做我在这里描述的操作？

2 - 我究竟应该如何解析该标题？也就是说，我想捕获在＆lt;之前发生的字符串。标题＆gt;和＆lt; / title＆gt;？

先发制人的谢谢。

Answer 1

假设<title>你的意思是标题标记，任何中途不错的XML解析器都可以轻松完成：它会在title标记出现时通知你找到，然后提取该标记内的文本（您想要的标题）。

Answer 2

正如大家所提到的，XML有很多解析器。但是，如果你想自己做，那么这里有一个可以工作的函数，除了标题元素标志（我不知道它们在技术上被称为什么）出现在注释掉的文本中或者是非法的文本部分。

def extract_text_between_flags(inputText, flagBegin, flagEnd):
    # Instantiate an empty list to store the results
    excerpts = list()

    # Find the first occurrence of the begin flag
    indexBegin = inputText.find(flagBegin)
    # Until the begin flag is no longer found
    while indexBegin > -1:
        # From the current begin flag location, search forward to the first
        # occurrence of the end flag
        indexEnd = inputText.find(flagEnd, indexBegin + len(flagBegin)) + len(flagEnd)
        # If the end flag is not found, stop searching
        if indexEnd <= 0:
            break
        # Extract the relevant passage from the text and add it to the list
        excerpt = inputText[indexBegin+len(flagBegin):indexEnd-len(flagEnd)]
        excerpts.append(excerpt)

        # Set the new search starting point as the next occurrence of the
        # begin flag
        indexBegin = inputText.find(flagBegin, indexEnd)

    return excerpts

titles = extract_text_between_flags(myXMLString, '< title >', '< /title >')

Python：搜索字符串，在该字符串后解析文本，添加到列表中

2 个答案: