Question

我无法弄清楚如何使用包含许多代码和代码＆＃39的标记中的xpath来获取格式化文本; S。

以这种形式：

<span>This</span>
<span> is</span>
<span> main</span>
<span> text</span>
<p><span>First</span>
   <span>p-tag</span<
</p>
<p> second p tag ....

所以有两种类型的标签。 表示此标记内的文字位于新行。文本本身分为标签中的许多子字符串。

问题是有些文字不在内。

从上面的代码片段中，我想得到（例如在列表中）：

['This is main text','First p-tag','seco....]

这项工作只是代码中的唯一文字：

def get_popis_url(url):
    root = get_root(url)
    ps = root.xpath('//div[@class="description"]/p')
    for p in ps:
        text = p.xpath('string()').replace('&nbsp',' ').strip()
        print text

所以上面的html片段的结果是：

First p-tag
second p tag

你有什么想法吗？

Answer 1

我不确定您使用的是哪个Python库（xml.etree？）。但是想一想XPath尝试'.//div[@class="description"]//*'。这将选择div开始的所有子元素。

使用xml.etree它看起来像这样（假设HTML源代码是以字符串形式给出的）：

def get_popis_url(html_source):
    import xml.etree.ElementTree as ET
    root = ET.fromstring(html_source)
    ps = root.findall('.//div[@class="description"]//*')
    for p in ps:
        text = p.text
        if text:
            print text.replace('&nbsp;',' ').strip()

Xpath - 获取由<p>标记分隔的文本

1 个答案: