Question

我不希望lxml向纯文本添加任何内容。我故意离开了他们。 lxml为纯文本添加try { smtp.Send(mail); } catch (Exception e) { Debug.WriteLine("Exception Message: " + e.Message); }标记。这里<p>可能是html或纯文本。我需要lxml来处理html并保留纯文本。

value

输出： import lxml.html mixed = ['plaintext', '<a>HTML</a>', '<a>HTML</a>'] for text in mixed: html = lxml.html.fromstring(text) print(lxml.html.tostring(html)) b'<p>plaintext</p>' b'<a>HTML</a>'

我需要的是： b'<a>HTML</a>' b'plaintext' b'<a>HTML</a>'

所以我提出了几个问题。

如何知道一个片段是纯粹的，没有任何HTML标签？（这样我就不必将它们传递给lxml）或
如何阻止lxml将b'<a>HTML</a>'标记添加到纯文本？

Answer 1

尝试这个库...保存我的但不必使用＆＃34;重新＆＃34;处理XML页面的模块，对于某些愚蠢的原因scrapy selctors工作不稳定...

from w3lib.html import remove_tags

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    follow = hxs.xpath('//loc').re('.*type=videos.*')
    follow = [remove_tags(x) for x in follow]
    # It wont remove regex lines like \n

防止python lxml添加纯文本<p>标记

1 个答案: