Scrapy General Scraper

时间:2017-10-05 09:58:29

标签: python-2.7 web-crawler

我试图为scrapy构建一个普通的刮刀 - 虽然看起来有点儿麻烦。我们的想法是它应该将URL作为输入,并且只从该特定URL中抓取页面,但它似乎是从网站上移动到youtube等。理想情况下,它还有一个深度选项,允许1,2 ,3等作为远离初始页面的链接数量。关于如何实现这一点的任何想法?

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib
from route import urls
import pickle
import os
import urllib2
import urlparse

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

def getAllUrl(url):
    try:
        page = urllib2.urlopen( url ).read()
    except:
        return []
    urlList = []
    try:
        soup = BeautifulSoup(page)
        soup.prettify()
        for anchor in soup.findAll('a', href=True):
            if not 'http://' in anchor['href']:
                if urlparse.urljoin(url, anchor['href']) not in urlList:
                    urlList.append(urlparse.urljoin(url, anchor['href']))
            else:
                if anchor['href'] not in urlList:
                    urlList.append(anchor['href'])

        length = len(urlList)

        return urlList
    except urllib2.HTTPError, e:
        print e

def listAllUrl(url):
    urls_new = list(set(url))
    return urls_new
count = 0

main_url = str(raw_input('Enter the url : '))
url_split=main_url.split('.',1)
folder_name =url_split[1]
txtfile_split = folder_name.split('.',1)
txtfile_name = txtfile_split[0]
url = getAllUrl(main_url)
urls_new = listAllUrl(url)

os.makedirs('c:/Scrapy/Extracted/'+folder_name+"/")
for url in urls_new:
    if url.startswith("http") or url.startswith(" "):
        if(main_url == url):
            url = url
        else:
            pass
    else:
        url = main_url+url
    if '#' in url:
        new_url = str(url).replace('#','/')
    else:
        new_url =url
    count = count+1
    if new_url:
        print""+str(count)+">>",new_url
        html = urllib.urlopen(new_url).read()
        page_text_data=text_from_html(html)
        with open("c:/Scrapy/Extracted/"+folder_name+"/"+txtfile_name+".txt", "a") as myfile:
            myfile.writelines("\n\n"+new_url.encode('utf-8')+"\n\n"+page_text_data.encode('utf-8'))
            path ='c:/Scrapy/Extracted/'+folder_name+"/"
        filename ="url"+str(count)+".txt"
        with open(os.path.join(path, filename), 'wb') as temp_file:
            temp_file.write(page_text_data.encode('utf-8'))
            temp_file.close()
    else:
        pass    

2 个答案:

答案 0 :(得分:1)

您当前的解决方案根本不涉及Scrapy。但正如你专门针对Scrapy所说的那样,你走了。

将您的蜘蛛基于CrawlSpider课程。这允许您抓取给定的网站,并可能指定导航应遵守的规则。

要禁止异地请求,请使用allowed_domains spider属性。或者,如果您使用CrawlSpider类,则可以在allow_domains中详细说明deny_domains Rule settings.py构造函数的$arr = array(0 => array('id' => "AMO"), 1 => array('id' => "PAT")); 属性。

要限制抓取深度,请在array( 'AMO' => array(), 'PAT' => array() ) 中使用LinkExtractor

答案 1 :(得分:0)

你有一个标签scrapy,但你根本不使用它。我建议你尝试使用它 - 这很容易。比尝试自己开发容易得多。已经有一个选项来限制特定域的请求。