如何以与下面相同的方式为Splash设置Scrapy的用户代理:
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.example.com"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
我的蜘蛛看起来与此类似:
import scrapy
from scrapy_splash import SplashRequest
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
args={'wait': 0.5}
)
答案 0 :(得分:2)
您需要设置library(data.table)
library(magrittr)
A <- read.table(header = TRUE, text = "
date exdate unique_id
1: 1999-01-04 1999-09-18 294
2: 1999-01-04 1999-09-18 295
3: 1999-01-04 1999-09-18 296
4: 1999-01-04 1999-09-18 297
5: 1999-01-04 1999-09-18 298
6: 1999-01-05 1999-09-18 299
") %>%
setDT() %>%
.[, date := as.Date(date)] %>%
.[, exdate := as.Date(exdate)]
B <- read.table(header = TRUE, text = "
payment_date amount
1: 1998-06-30 4.18
2: 1998-09-30 4.26
3: 1998-12-31 4.00
4: 1999-03-31 4.01
5: 1999-06-30 4.18
6: 1999-09-30 4.45
") %>%
setDT() %>%
.[, payment_date := as.Date(payment_date)]
B[, payment_date_copy := payment_date][A, on = .(payment_date > date, payment_date <= exdate)] %>%
setnames(1:5, c("date", "amount", "payment_date", "exdate", "unique_id")) %>%
print()
属性以覆盖默认用户代理:
user_agent
在这种情况下,UserAgentMiddleware
(enabled by default)会将USER_AGENT
设置值覆盖为class ExampleSpider(scrapy.Spider):
name = 'example'
user_agent = 'Mozilla/5.0'
。
您还可以覆盖每个请求的标头:
'Mozilla/5.0'
答案 1 :(得分:2)
正确的方法是改变启动脚本以包含它...但是,如果它也能正常工作,不要将它添加到蜘蛛中。
http://splash.readthedocs.io/en/stable/scripting-ref.html?highlight=agent