我正在测试我的手是否刮擦。到目前为止,我为取得的进展感到兴奋,但有一个问题,即源网站的数据模型似乎与我当前的scrapy输出不一致。
源提供类别,类型和URL数据-每个类别包含多种类型,每种类型都有一个URL。
我希望获得一个输出,该输出保持数据的嵌套,其中每一行与类别,类型和URL分组相关联。
XLM和CSV输出均提供了唯一的类别,但每个类别行的列中都包含所有后续的类型和URL数据。
来源/示例网站:
<div class="box">
<div class="coin-img coin-imgfile--9999 coin-img-3"></div>
<div class="coin-heading">
<h3>Half-Cents and Cents</h3>
</div>
<ul>
<li><a href="/auctionprices/category/liberty-cap-half-cent-1793-1797/34">Liberty Cap Half Cent (1793-1797)</a></li>
<li><a href="/auctionprices/category/draped-bust-half-cent-1800-1808/653">Draped Bust Half Cent (1800-1808)</a></li>
<li><a href="/auctionprices/category/classic-head-half-cent-1809-1836/654">Classic Head Half Cent (1809-1836)</a></li>
</ul>
</div>
<div class="box">
<div class="coin-img coin-imgfile--9999 coin-img-5"></div>
<div class="coin-heading">
<h3>Two and Three Cents</h3>
</div>
<ul>
<li><a href="/auctionprices/category/two-cent-1864-1873/670">Two Cent (1864-1873)</a></li>
<li><a href="/auctionprices/category/three-cent-silver-1851-1873/77">Three Cent Silver (1851-1873)</a></li>
<li><a href="/auctionprices/category/three-cent-nickel-1865-1889/671">Three Cent Nickel (1865-1889)</a></li>
</ul>
</div>
工作中的蜘蛛会抓取所有必要的数据,但未按要求格式化:
import scrapy
class PCGSSpider(scrapy.Spider):
name = "pcgs_spider"
custom_settings = {
'FEED_FORMAT': 'xml',
'FEED_URI': 'pcgsspider.xml'
}
start_urls = ['abovesample.html']
def parse(self, response):
SET_SELECTOR = '.box'
for pcgs in response.css(SET_SELECTOR):
CAT_SELECTOR = 'h3 ::text'
TYPE_SELECTOR = './/ul/li/a/text()'
URL_SELECTOR = './/ul/li/a/@href'
yield {
'categories': pcgs.css(CAT_SELECTOR).extract(),
'types': pcgs.xpath(TYPE_SELECTOR).extract(),
'type_url': pcgs.xpath(URL_SELECTOR).extract(),
}
并且XML显示正确的数据,但未嵌套每个URL及其TYPE和TYPE及其CATEGORY
-<item>
-<categories>
<value>Half-Cents and Cents</value>
</categories>
-<types>
<value>Liberty Cap Half Cent (1793-1797)</value>
<value>Draped Bust Half Cent (1800-1808)</value>
<value>Classic Head Half Cent (1809-1836)</value>
</types>
-<type_url>
<value>/auctionprices/category/lincoln-cent-wheat-reverse-1909-1958/46</value>
<value>/auctionprices/category/lincoln-cent-modern-1959-date/47</value>
<value>/auctionprices/category/lincoln-cent-modern-1959-date/47</value>
</type_url>
</item>
这是一个非常新的知识,请原谅任何无知-似乎可以通过某种程度的迭代来解决问题,尽管我不清楚在蜘蛛中是否是解决数据和结构核心的最佳位置完成。
答案 0 :(得分:0)
您必须在类别字段上使用extract_first()方法而不是extract()。我从刮取PCGS中获得的示例:
<items>
<item><categories>Half-Cents and Cents</categories><types>Liberty Cap Half Cent (1793-1797)</types><type_url>/auctionprices/category/liberty-cap-half-cent-1793-1797/34</type_url></item>
<item><categories>Two and Three Cents</categories><types>Two Cent (1864-1873)</types><type_url>/auctionprices/category/two-cent-1864-1873/670</type_url></item>
<item><categories>Nickels</categories><types>Shield Nickel (1866-1883)</types><type_url>/auctionprices/category/shield-nickel-1866-1883/81</type_url></item>
</items>
希望这就是您想要的。
答案 1 :(得分:0)
我看到的唯一方法是使用类型和网址为每个链接复制CATEGORY
值:
import scrapy
class PCGSSpider(scrapy.Spider):
name = "pcgs_spider"
custom_settings = {
'FEED_FORMAT': 'xml',
'FEED_URI': 'pcgsspider.xml'
}
start_urls = ['abovesample.html']
def parse(self, response):
for div_box in response.css("div.box"):
category = div_box.css("h3 ::text").extract_first()
for li in div_box.css("ul li"):
yield { 'category':category,
'type':li.css("a ::text").extract_first(),
'url' :li.css("a ::attr(href)").extract_first
}