这是我的 Google 搜索结果抓取代码。
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items
我可以为每个标题/链接获得九个项目结果。
<块引用>https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0
当我打开 excel 文件 (test1.xlsx) 时,所有链接都无法正常打开。
在“settings.py”中添加如下。
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
<块引用>ROBOTSTXT_OBEY = 错误
答案 0 :(得分:2)
如果你仔细观察你提取的 url,它们都有 sa
、ved
和 usg
查询参数。显然,这些不是目标网站网址的一部分,而是谷歌搜索结果查询参数。
要仅获取目标网址,您应该使用 urllib
库解析网址,并仅提取 q
查询参数。
from urllib.parse import urlparse, parse_qs
parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
target_url = query_params["q"][0]
完整的工作代码:
from urllib.parse import urlparse, parse_qs
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
# Parsing item url
parsed_url = urlparse(links[idx])
query_params = parse_qs(parsed_url.query)
item['link'] = query_params["q"][0]
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items