使用Python scrapy从网页上获取内容,我想按以下顺序获取内容:
客厅,椅子的链接
客厅,沙发的链接
...
床房,床的链接
床房,镜子的链接
...
现在,URL是正确的,但是在sub_cat
中打印的所有parse_item_info
都是Living room
。当我尝试在sub_cat
中打印出parse_item
时,我得到了所有子类别。
我认为问题在于<li>
中的标签<ul>
被获取了两次。我怎样才能使他们一对一地正确?
谢谢。
html:
<div class="row margin-b2">
<div class="col">
<ul class="list-unstyled">
<li class="font-size-3 color-yellow main-list-li">
Living room
</li>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/536" class="text-light">Chair</a>
</li>
...
</ul>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/537" class="text-light">Sofa</a>
</li>
...
</ul>
<li class="font-size-3 color-yellow main-list-li">
Bed room
</li>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/538" class="text-light">Bed</a>
</li>
...
</ul>
<ul class="list-inline main-list-ul">
<li class="list-inline-item main-list-li-w align-text-top">
<a href="https://www.website.com/furiture/539" class="text-light">Mirror</a>
</li>
...
</ul>
... </ul>
</ul>
</div>
</div>
Python:
def parse_item(self, response):
cat = response.meta["cat"]
out_box = response.xpath('//div[@class="row margin-b2"]')
# get all sub categories first
sub_cat_arr = []
for box in out_box.xpath('//li[@class="font-size-3 color-yellow main-list-li"]'):
sub_cat = box.xpath('./text()').extract()[0].strip()
sub_cat_arr.append(sub_cat)
i = 0
for box in out_box.xpath('//ul[@class="list-inline main-list-ul"]'):
sub_cat = sub_cat_arr[i]
i += 1
print("in......")
print(sub_cat)
for url_box in box.xpath('//li[@class="list-inline-item main-list-li-w align-text-top"]//a'):
new_url = url_box.xpath('.//@href').extract()[0]
yield scrapy.Request(new_url, meta={"url": new_url, "cat": cat, "sub_cat": sub_cat}, callback=self.parse_item_info)
def parse_item_info(self, response):
cat = response.meta["cat"]
sub_cat = response.meta["sub_cat"]
url = response.meta["url"]
print(sub_cat)
print(url)
...
答案 0 :(得分:0)
为了不对同一标签进行两次处理,您肯定只需要使用一个周期:
def parse_item(self, response):
cat = response.meta["cat"]
for tag in response.css("ul.list-unstyled li.font-size-3.color-yellow.main-list-li, ul.list-inline.main-list-ul li a"):
if tag.root.tag == "li":
current_sub_cat = tag.css("*::text").extract_first("").strip("\n ")
elif tag.root.tag == "a":
new_url = tag.css("*::attr(href)").extract_first()
sub_cat = current_sub_cat
yield scrapy.Request(url=new_url, meta={"new_url": new_url, "sub_cat": sub_cat, "cat": cat}, callback=self.parse_item_info)