Question

第一篇文章在这里，我是新手。我正在尝试抓取.htm网址页面： https://www.bls.gov/bls/news-release/empsit.htm#2008

并从每个爬网页面中提取两个数据点，然后使用Scrapy将每次迭代保存到一个csv文件中。

当前，我有：

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

class EmpsitSpider(scrapy.Spider):
name = "EmpsitSpider"
allowed_domains = ["bls.gov"]
start_urls = [
'https://www.bls.gov/bls/news-release/empsit.htm#2008'
]

rules = [
    Rule(
        LinkExtractor(allow_domains=("bls.gov"), restrict_xpaths=('//div[@id="bodytext"]/a[following-sibling::text()[contains(., ".htm")]]')), 
        follow= True, callback= "parse_items"),
        ]

def parse_items(self, response):
    self.logger.info("bls item page %s", response.url)
    item = scrapy.Item()
    item["SA"] = response.xpath(('//*[@id="ces_table1"]/tbody/tr[138]/td[8]/span')[0].text).extract()
    item["NSA"] = response.xpath(tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[4]/span')[0].text).extract()
    return item

然后运行

    scrapy crawl EmpsitSpider -o data.csv

我遇到了瓶颈。我无法遍历html页面。我可以使用它们的xpaths和lxml从每个页面中提取两个数据点：

from lxml import html
import requests

page = 
requests.get('https://www.bls.gov/news.release/archives/empsit_01042019.htm')
tree = html.fromstring(page.content)

#part time for economic reasons, seasonally adjusted
SA = tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[8]/span')[0].text
print(SA)

#part time for economic reasons, not seasonally adjusted
NSA = tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[4]/span')[0].text
print(NSA)

我无法抓狂地浏览这些网址。对如何进行有任何想法吗？谢谢您的帮助。

URL的抓取抓取页面，从每个新页面中提取值

0 个答案: