Question

我的文本格式为（保留标签并删除文本以供理解）

<h2>...</h2>
  <p>...</p>
   .      .
   .      .
  <p>...</p>
<h2>...</h2>
  <ul>...</ul>
     <li> .. </li>
  ...
<h2>...</h2>
   <li> ..</li>

我正在尝试使用scrapy根据标题将文本分开/分组。因此，第一步，我需要从上述数据中获取3组数据。

from scrapy import Selector 
sentence = "above text in the format"
sel = Selector(text = sentence)
// item = sel.xpath("//h2//text())
item = sel.xpath("//h2/following-sibling::li/ul/p//text()").extract()

我得到一个空数组。任何帮助表示赞赏。

Answer 1

我有这个解决方案，很容易搞定

import scrapy
from lxml import etree, html


class TagsSpider(scrapy.Spider):
    name = 'tags'
    start_urls = [
        'https://support.litmos.com/hc/en-us/articles/227739047-Sample-HTML-Header-Code'
    ]

    def parse(self, response):
        for header in response.xpath('//header'):
            with open('test.html', 'a+') as file:
                file.write(
                    etree.tostring(
                        html.fromstring(header.extract()),
                        encoding='unicode',
                        pretty_print=True,
                    )
                )

通过它我可以获得标题和标题中的所有内容

根据HTML文本中的标签对文本进行分组

1 个答案: