Python Scrapy-将抓取的数据从第一个URL传递到第二个URL并抓取数据

时间:2019-11-27 12:55:25

标签: python web-scraping scrapy

我看过StackOverflow上的几篇文章,但仍然不太了解如何做到这一点。在Scrapy中,我从一个URL上抓取书籍。对于要抓取的书籍的每条记录,我想将其传递到另一个网站的搜索字段,并从该网站中获取特定元素。但是,Scrapy似乎停留在第一个网站上,而没有从第二个网站检索结果。 (我基本上是在Scrapy中尝试复制使用Selenium可以轻松完成的操作-使用Selenium&BS则要慢一些。)

import scrapy
import pandas as pd
from datetime
import datetime
from timeit
import default_timer as timer
from fake_useragent
import UserAgent


start = timer()
d1 = datetime.now()

book = []
country = []


stmp = []
items = []
today = datetime.now()
tt = today.strftime('%Y-%m-%d_%H_%M_%S')

class CoinsSpider(scrapy.Spider):
  name = "proxies"
custom_settings = {
  'DOWNLOAD_DELAY': 3,
  'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
  'HTTPCACHE_ENABLED': True,
  'DOWNLOADER_MIDDLEWARES': {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
  },
  'DEFAULT_REQUEST_HEADERS': {
    'Referer': 'http://www.google.com'
  }
}

def start_requests(self):
  url = "https://www.book.net/"
  yield scrapy.Request(url = url, callback = self.parse)

def parse(self, response):
  for row in response.css("#booklisttable > tbody:nth-child(1) tr"):
    b = row.css('.title::text')[0].extract()
    book.append(b)

try:
  request = scrapy.Request('https://www.searchbook.com/international',
  callback = self.parse_gua,cb_kwargs = dict(main_url = response.url))
  yield request

  print(response.headers)
  print(response.css)

  check = response.css('.fc-today__dayofmonth::text').extract()
  print(check)
except:
  pass


crts = row.css('.country::text')[0].extract()
country.append(crts)

tstmp = str(d1)
stmp.append(tstmp)

item = {
  "Title": book,
  "Country": country,
  "Timex": stmp,
}

test_df = pd.DataFrame.from_dict(item, orient = 'columns').replace('\n', '', regex = True)
test_df['Joined'] = test_df['Title'] + ':' + test_df['Country']

items.append(test_df)

result = pd.concat([pd.DataFrame(items[i]) for i in range(len(items))], ignore_index = True)
with open('my_books_' + tt + '.csv', 'a', newline = '') as f:
  result.to_csv(f)

print("Completed")
end = timer()
elapse = end - start
print("It Took" + str(elapse))

很抱歉,我的问题是基本问题,但是如果有人可以指向文档中的链接,或者是一个很好的示例!

0 个答案:

没有答案