我对python和scrapy比较陌生。我想要实现的是抓取一些网站,主要是公司网站。抓取完整域并提取所有h1 h2 h3。创建一个包含域名的记录和一个包含该域中所有h1 h2 h3的字符串。基本上有一个Domain项和一个包含所有头的大字符串。
我想要输出 DOMAIN,STRING(h1,h2,h2) - 来自此域名的所有网址
我遇到的问题是每个URL都进入单独的项目。我知道我没有走得太远但是在正确的方向上的暗示会非常感激。基本上,我如何创建一个外部循环,以便yield语句继续运行,直到下一个域启动。
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html
class MySpider(CrawlSpider):
name = 'Webcrawler'
allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
xpath = HtmlXPathSelector(response)
loader = XPathItemLoader(item=WebsiteItem(), response=response)
loader.add_xpath('h1',("//h1/text()"))
loader.add_xpath('h2',("//h2/text()"))
loader.add_xpath('h3',("//h3/text()"))
yield loader.load_item()
答案 0 :(得分:1)
yield语句一直持续到下一个域为止。
无法完成,事情并行完成,无法进行串行爬网。
你可以做的是write a pipeline会积累它们并在spider_close
上产生整个结构,例如:
# this assume your item looks like the following
class MyItem():
domain = Field()
hs = Field()
import collections
class DomainPipeline(object):
accumulator = collections.defaultdict(set)
def process_item(self, item, spider):
accumulator[item['domain']].update(item['hs'])
def close_spider(spider):
for domain,hs in accumulator.items():
yield MyItem(domain=domain, hs=hs)
用法:
>>> from scrapy.item import Item, Field
>>> class MyItem(Item):
... domain = Field()
... hs = Field()
...
>>> from collections import defaultdict
>>> accumulator = defaultdict(set)
>>> items = []
>>> for i in range(10):
... items.append(MyItem(domain='google.com', hs=[str(i)]))
...
>>> items
[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain': 'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com', 'hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'domain': 'google.com', 'hs': ['9']}]
>>> for item in items:
... accumulator[item['domain']].update(item['hs'])
...
>>> accumulator
defaultdict(<type 'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})
>>> for domain, hs in accumulator.items():
... print MyItem(domain=domain, hs=hs)
...
{'domain': 'google.com',
'hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}
>>>