Question

当我尝试抓取Google搜索结果时，Scrapy只会产生Google主页： http://pastebin.com/FUbvbhN4

这是我的蜘蛛：

import scrapy

class GoogleFinanceSpider(scrapy.Spider):
    name = "google"
    start_urls = ['http://www.google.com/#q=finance.google.com:+3m+co']
    allowed_domains = ['www.google.com']

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

此网址作为起始网址是否有问题？当我在浏览器中打开它时 - 将其放在地址栏中（而不是填写搜索表单） - 我会得到有效的搜索结果。

Answer 1

对于大多数情况，google会将蜘蛛重定向到CAPTCHA页面，bing搜索结果更容易抓取。

有一个抓取Google / Bing / Baidu https://github.com/titantse/seCrawler

搜索结果的项目

Answer 2

是的，看起来该地址正在重定向到主页：

scrapy shell http://www.google.com/#q=finance.google.com:+3m+co的例子：

...
[s]   request    <GET http://www.google.com/#q=finance.google.com:+3m+co>
[s]   response   <200 http://www.google.com/>
...

检查您的网址是否有意义，它不包含参数，但#q（这不是网址参数），浏览器是识别并使其成为谷歌搜索的人，因此它是不完全是网址路径。

正确的Google搜索网址是：http://www.google.com/search?q=YOURQUERY

Scrapy：Google Crawl不起作用

2 个答案: