第一篇文章在这里,我是新手。我正在尝试抓取.htm网址页面: https://www.bls.gov/bls/news-release/empsit.htm#2008
并从每个爬网页面中提取两个数据点,然后使用Scrapy将每次迭代保存到一个csv文件中。
当前,我有:
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class EmpsitSpider(scrapy.Spider):
name = "EmpsitSpider"
allowed_domains = ["bls.gov"]
start_urls = [
'https://www.bls.gov/bls/news-release/empsit.htm#2008'
]
rules = [
Rule(
LinkExtractor(allow_domains=("bls.gov"), restrict_xpaths=('//div[@id="bodytext"]/a[following-sibling::text()[contains(., ".htm")]]')),
follow= True, callback= "parse_items"),
]
def parse_items(self, response):
self.logger.info("bls item page %s", response.url)
item = scrapy.Item()
item["SA"] = response.xpath(('//*[@id="ces_table1"]/tbody/tr[138]/td[8]/span')[0].text).extract()
item["NSA"] = response.xpath(tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[4]/span')[0].text).extract()
return item
然后运行
scrapy crawl EmpsitSpider -o data.csv
我遇到了瓶颈。我无法遍历html页面。我可以使用它们的xpaths和lxml从每个页面中提取两个数据点:
from lxml import html
import requests
page =
requests.get('https://www.bls.gov/news.release/archives/empsit_01042019.htm')
tree = html.fromstring(page.content)
#part time for economic reasons, seasonally adjusted
SA = tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[8]/span')[0].text
print(SA)
#part time for economic reasons, not seasonally adjusted
NSA = tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[4]/span')[0].text
print(NSA)
我无法抓狂地浏览这些网址。对如何进行有任何想法吗?谢谢您的帮助。