如何使用垃圾邮件从注释标签中提取内容?
例如,如何在以下示例中提取“黄色”:
<div class="fruit">
<div class="infos">
<h2 class="Name">Banana</h2>
<span class="edible">Edible: Yes</span>
</div>
<!--
<p class="color">Yellow</p>
-->
</div>
答案 0 :(得分:5)
您可以使用类似//comment()
的XPath表达式来获取评论内容,然后在删除评论标记后解析该内容。
示例scrapy shell会话:
paul@wheezy:~$ scrapy shell
...
In [1]: doc = """<div class="fruit">
...: <div class="infos">
...: <h2 class="Name">Banana</h2>
...: <span class="edible">Edible: Yes</span>
...: </div>
...: <!--
...: <p class="color">Yellow</p>
...: -->
...: </div>"""
In [2]: from scrapy.selector import Selector
In [4]: selector = Selector(text=doc, type="html")
In [5]: import re
In [6]: regex = re.compile(r'<!--(.*)-->', re.DOTALL)
In [7]: selector.xpath('//comment()').re(regex)
Out[7]: [u'\n <p class="color">Yellow</p>\n ']
In [8]: comment = selector.xpath('//comment()').re(regex)[0]
In [9]: commentsel = Selector(text=comment, type="html")
In [10]: commentsel.css('p.color')
Out[10]: [<Selector xpath=u"descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' color ')]" data=u'<p class="color">Yellow</p>'>]
In [11]: commentsel.css('p.color').extract()
Out[11]: [u'<p class="color">Yellow</p>']
In [12]: commentsel.css('p.color::text').extract()
Out[12]: [u'Yellow']
答案 1 :(得分:0)
首先,使用xpath下面的内容来获取页面中的所有评论。
data = response.xpath('//comment()').extract()
现在,使用任何键值标识您的意思评论。
up_data = []
for d in data:
if 'key' in d:
up_data.append(d)
定义,
html_template = '<html><body>%s</body></html>'
for up_d in up_data:
up_d = html_template % up_d.replace('<!--','').replace('-->', '')
sel = Selector(text=up_d)
sel.xpath('//div[@class="table_outer_container"]')
// DO what you want