使用 Scrapy 抓取时缺少某些 HTML 元素

时间:2021-04-23 16:17:39

标签: python html scrapy web-crawler

我正在尝试从网站的 HTML 元素中抓取一些文本。大多数情况下都很好,但由于某些原因,Scrapy 无法获取浏览器检查中显示的所有 HTML 元素。内容是静态的,因为我尝试禁用 JavaScript 但仍然在我的浏览器中显示那些缺失的元素。 该网站的结构类似于:

<ul class="paragraph-title">...</ul>
<ul class="paragraph-title">
    <p>TEXT 1</p>
    <p class="list-item">TEXT 2</p>
    <p class="list-item">TEXT 3</p>
</ul>
<ul class="paragraph-title">
    <p>TEXT 4</p>
    <ol class="level-one"></ol>
    <ol class="level-two">
        <li class="level-two-item">TEXT 5</li>
        <li class="level-two-item">TEXT 6</li>
    </ol>
</ul>
<ul class="paragraph-title">...</ul>

这是我的 Scrapy Spider:

import scrapy
class MySpider(scrapy.Spider):
    name = "MySpider"
    start_urls = ['https://www.example.com']
def parse(self, response):
    entries = response.css('ul.paragraph-title')
    for entry in entries:
        yield {
            'text': entry.css('::text').getall()
        }

当我在 entries[2].getall() 中尝试 scrapy shell 时,我注意到scrapy 未能在第三个 ul 中找到 ol 和 li 标签:

['<ul class="paragraph-title"><p>TEXT 4</p></ul>']

如何从 li 标签中获取“TEXT 5”和“TEXT 6”?

1 个答案:

答案 0 :(得分:0)

您可以直接使用 from PIL import ImageTk PANEL_HEIGHT = 440 PANEL_WIDTH = 304 BKG_IMG = "./backgrounds/Wireframe- welcome screen – 1.png" MANAGE_ACC_BKG_IMG = "./backgrounds/Wireframe- welcome screen – 1.png" panel = Tk() panel.geometry(f"{PANEL_WIDTH}x{PANEL_HEIGHT}") class window1: def __init__(self): self.canvas = Canvas(panel,width=2*PANEL_WIDTH, height=2*PANEL_HEIGHT) canvas = self.canvas self.img = ImageTk.PhotoImage(file=BKG_IMG) canvas.create_image(PANEL_WIDTH,PANEL_HEIGHT,image=self.img) canvas.place(x=-(PANEL_WIDTH/2),y=-(PANEL_HEIGHT/2)) class window2: def __init__(self): self.canvas1 = Canvas(panel, width=2 * PANEL_WIDTH, height=2 * PANEL_HEIGHT) canvas1 = self.canvas1 self.img1 = ImageTk.PhotoImage(file=MANAGE_ACC_BKG_IMG) canvas1.create_image(PANEL_WIDTH, PANEL_HEIGHT, image=self.img1) canvas1.place(x=-(PANEL_WIDTH / 2), y=-(PANEL_HEIGHT / 2)) window1() window2() panel.mainloop (例如使用 lxml.html)来检查它如何解析此 HTML 代码。在这种情况下是

lxml.html.tostring(lxml.html.parse('foo.html'))

所以它不支持在 <ul class="paragraph-title"> <p>TEXT 4</p> </ul><ol class="level-one"></ol> <ol class="level-two"> <li class="level-two-item">TEXT 5</li> <li class="level-two-item">TEXT 6</li> </ol> 内嵌套 ol。我不知道这是一个错误还是一个深思熟虑的决定。

相关问题