我需要在谷歌搜索并使用scrapy获取特定单词的计数

时间:2018-01-01 07:19:23

标签: python-3.x scrapy

我需要在谷歌搜索并使用scrapy获取特定单词的计数。如果给你的申请一个字。我在谷歌搜索并打印了匹配单词的计数。不要在应用程序中从控制台中对该单词进行硬编码

这是从谷歌获取计数的代码:

import scrapy
    import re

    class GoogleSpider(scrapy.Spider):

        name = 'Google'
        allowed_domains = ['www.google.co.in']

        def __init__(self, word=None):
            super().__init__()
            self.word = word
            self.start_urls = ['https://www.google.co.in/search?q='+self.word]

        def parse(self, response):
            print('url:', response.url)
            text = response.xpath('//div[@class="g"]//text()').extract()        
            text = ''.join(text).lower()
            count = len(re.findall(self.word, text))
            print('count:', count)

from scrapy.crawler import CrawlerProcess
import sys

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
})
#c.crawl(GoogleSpider, word='abba')
c.crawl(GoogleSpider, word=sys.argv[1])
c.start()

使用.i运行代码时出现以下错误。请找到解决方案?

2018-01-02 11:28:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-01-02 11:28:04 [scrapy.core.engine] INFO: Spider opened
2018-01-02 11:28:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-02 11:28:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-02 11:28:04 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.google.co.in/searchq=newyear> (referer: None)
2018-01-02 11:28:04 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.google.co.in/searchq=newyear>: HTTP status code is not handled or not allowed
2018-01-02 11:28:04 [scrapy.core.engine] INFO: Closing spider (finished)
2018-01-02 11:28:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 207,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1860,
 'downloader/response_count': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 1, 2, 5, 58, 4, 873996),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/404': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 1, 2, 5, 58, 4, 500024)}
2018-01-02 11:28:04 [scrapy.core.engine] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:0)

你最大的错误:你没有做任何事情来检查你是否真的得到了预期的数据以及是否有预期的数字。代码也会收到错误消息,但您不能添加它。

第一个错误是网址。网址https://www.google.co.in/search?dcr=0'+self.word不会打开包含结果的页面对我来说只有https://www.google.co.in/search?g='+self.word。请参阅q=,而不是dcr=0

self.word中的第二个错误是start_urls = [...]。在start_urls执行之前创建了__init__,因此self.word尚未存在。在__init__内你必须做

self.start_urls = ['https://www.google.co.in/search?q='+self.word]

您应该收到错误消息,但是您没有遇到此错误 - 这是一个很大的错误。

第三个错误 - 您可能不知道的这个错误 - Google在页面上使用JavaScript,但是当您使用不使用JavaScript的浏览器或程序/脚本时,它会将结果发送到不同的标签和不同的类。 在浏览器中关闭JavaScript并再次打开Goog​​le以查看差异。

所以没有class="sbqs_c"但结果在class="g"

此代码为"count: 42"

提供了"abba"

编辑:我添加sys.argv所以现在它可以作为

运行
python script.py abba

python script.py "new year"

代码:

import scrapy
import re

class GoogleSpider(scrapy.Spider):

    name = 'Google'
    allowed_domains = ['www.google.co.in']

    def __init__(self, word=None):
        super().__init__()
        self.word = word
        self.start_urls = ['https://www.google.co.in/search?q='+self.word]

    def parse(self, response):
        print('url:', response.url)

        text = response.xpath('//*[@class="g"]//text()').extract() # text only in search results
        #text = response.xpath('//text()').extract() # all text on page

        text = ''.join(text).lower()
        count = len(re.findall(self.word, text))

        print('count:', count)

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess
import sys

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
})
#c.crawl(GoogleSpider, word='abba')
c.crawl(GoogleSpider, word=sys.argv[1])
c.start()