使用Scrapy的ItemLoader,我想解析HTML元素中的前n个字符(它将包含多个嵌入的HTML元素,每个元素可能包含也可能不包含构成其中一部分的文本要保留的文字。)
以下是一个示例设置:
示例HTML:
<div class="about-copy">
<p>Developers trust Stack Overflow to help solve coding problems
and use Stack Overflow Jobs to find job opportunities. We’re
committed to making the internet a better place, and our products
aim to enrich the lives of developers as they grow and mature in
their careers.
</p>
<a href='...'></a>
<p>Founded in 2008, Stack Overflow sees 40 million visitors each month
and is the flagship site of the Stack Exchange network, home to 150+
Q&A sites dedicated to niche topics.
</p>
</div>
解析器代码:
def parse_details(self, response):
...
l = ItemLoader(item=Entry(), response=response)
# this is presumably the portion of the code that is to be modified
l.add_css('f_brief_summary', 'div.about-copy::text')
...
期望的结果:
Developers trust Stack Overflow to help solve coding problems
and use Stack Overflow Jobs to find job opportunities. We’re
committed to making the internet a better place, and our products
aim to enrich the lives of developers as they grow and mature in
their careers. Founded in 2008, Stack Overflow
是否有使用ItemLoader执行此操作的一步方法,或者是否应手动完成解析,然后使用&#39; add_value&#39;将文本添加到ItemLoaded对象中。方法
答案 0 :(得分:1)
不使用通用ItemLoader,而是创建自己的Loader类。然后,您可以对每个字段应用前处理和后处理,或者为它们全部定义处理器。请参阅:Scrapy Item Loaders Guide
在定义Entry项目的模块中,添加以下内容。请注意,在下面的示例中,我使用&#34; remove_tags&#34;方法而不是&#34; :: text&#34;在你的选择器中。
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join
from w3lib.html import remove_tags
# You can do so much better here!
def format_me(x):
return x.replace('\n', ' ').replace(' ', ' ').strip()
# Here is the Loader you need to add; mine only covers one field.
class EntryLoader(ItemLoader):
f_brief_summary_in = MapCompose(remove_tags, format_me)
f_brief_summary_out = Join()
# You already have this; mine only covers one field.
class Entry(scrapy.Item):
f_brief_summary = scrapy.Field()
这可以为您提供所需的结果。测试:
将您的示例代码段保存到文件中,例如example.html的
运行scrapy shell
scrapy shell './example.html'
在shell中导入你的Item和Loader:
from scrapyproj.entry_module import EntryLoader, Entry
测试解析器:
entry_loader = EntryLoader(item=Entry(), response=response)
entry_loader.add_css('f_brief_summary', 'div.about-copy')
entry_loader.load_item()
输出:
{'f_brief_summary': 'Developers trust Stack Overflow to help solve coding '
'problems and use Stack Overflow Jobs to find job '
'opportunities. We’re committed to making the internet a '
'better place, and our products aim to enrich the lives '
'of developers as they grow and mature in their '
'careers. Founded in 2008, Stack Overflow sees 40 '
'million visitors each month and is the flagship site of '
'the Stack Exchange network, home to 150+ Q&A sites '
'dedicated to niche topics.'}