我正在尝试从网站的 HTML 元素中抓取一些文本。大多数情况下都很好,但由于某些原因,Scrapy 无法获取浏览器检查中显示的所有 HTML 元素。内容是静态的,因为我尝试禁用 JavaScript 但仍然在我的浏览器中显示那些缺失的元素。 该网站的结构类似于:
<ul class="paragraph-title">...</ul>
<ul class="paragraph-title">
<p>TEXT 1</p>
<p class="list-item">TEXT 2</p>
<p class="list-item">TEXT 3</p>
</ul>
<ul class="paragraph-title">
<p>TEXT 4</p>
<ol class="level-one"></ol>
<ol class="level-two">
<li class="level-two-item">TEXT 5</li>
<li class="level-two-item">TEXT 6</li>
</ol>
</ul>
<ul class="paragraph-title">...</ul>
这是我的 Scrapy Spider:
import scrapy
class MySpider(scrapy.Spider):
name = "MySpider"
start_urls = ['https://www.example.com']
def parse(self, response):
entries = response.css('ul.paragraph-title')
for entry in entries:
yield {
'text': entry.css('::text').getall()
}
当我在 entries[2].getall()
中尝试 scrapy shell
时,我注意到scrapy 未能在第三个 ul 中找到 ol 和 li 标签:
['<ul class="paragraph-title"><p>TEXT 4</p></ul>']
如何从 li 标签中获取“TEXT 5”和“TEXT 6”?
答案 0 :(得分:0)
您可以直接使用 from PIL import ImageTk
PANEL_HEIGHT = 440
PANEL_WIDTH = 304
BKG_IMG = "./backgrounds/Wireframe- welcome screen – 1.png"
MANAGE_ACC_BKG_IMG = "./backgrounds/Wireframe- welcome screen – 1.png"
panel = Tk()
panel.geometry(f"{PANEL_WIDTH}x{PANEL_HEIGHT}")
class window1:
def __init__(self):
self.canvas = Canvas(panel,width=2*PANEL_WIDTH, height=2*PANEL_HEIGHT)
canvas = self.canvas
self.img = ImageTk.PhotoImage(file=BKG_IMG)
canvas.create_image(PANEL_WIDTH,PANEL_HEIGHT,image=self.img)
canvas.place(x=-(PANEL_WIDTH/2),y=-(PANEL_HEIGHT/2))
class window2:
def __init__(self):
self.canvas1 = Canvas(panel, width=2 * PANEL_WIDTH, height=2 * PANEL_HEIGHT)
canvas1 = self.canvas1
self.img1 = ImageTk.PhotoImage(file=MANAGE_ACC_BKG_IMG)
canvas1.create_image(PANEL_WIDTH, PANEL_HEIGHT, image=self.img1)
canvas1.place(x=-(PANEL_WIDTH / 2), y=-(PANEL_HEIGHT / 2))
window1()
window2()
panel.mainloop
(例如使用 lxml.html
)来检查它如何解析此 HTML 代码。在这种情况下是
lxml.html.tostring(lxml.html.parse('foo.html'))
所以它不支持在 <ul class="paragraph-title">
<p>TEXT 4</p>
</ul><ol class="level-one"></ol>
<ol class="level-two">
<li class="level-two-item">TEXT 5</li>
<li class="level-two-item">TEXT 6</li>
</ol>
内嵌套 ol
。我不知道这是一个错误还是一个深思熟虑的决定。