我试图为scrapy构建一个普通的刮刀 - 虽然看起来有点儿麻烦。我们的想法是它应该将URL作为输入,并且只从该特定URL中抓取页面,但它似乎是从网站上移动到youtube等。理想情况下,它还有一个深度选项,允许1,2 ,3等作为远离初始页面的链接数量。关于如何实现这一点的任何想法?
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib
from route import urls
import pickle
import os
import urllib2
import urlparse
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
def getAllUrl(url):
try:
page = urllib2.urlopen( url ).read()
except:
return []
urlList = []
try:
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
if not 'http://' in anchor['href']:
if urlparse.urljoin(url, anchor['href']) not in urlList:
urlList.append(urlparse.urljoin(url, anchor['href']))
else:
if anchor['href'] not in urlList:
urlList.append(anchor['href'])
length = len(urlList)
return urlList
except urllib2.HTTPError, e:
print e
def listAllUrl(url):
urls_new = list(set(url))
return urls_new
count = 0
main_url = str(raw_input('Enter the url : '))
url_split=main_url.split('.',1)
folder_name =url_split[1]
txtfile_split = folder_name.split('.',1)
txtfile_name = txtfile_split[0]
url = getAllUrl(main_url)
urls_new = listAllUrl(url)
os.makedirs('c:/Scrapy/Extracted/'+folder_name+"/")
for url in urls_new:
if url.startswith("http") or url.startswith(" "):
if(main_url == url):
url = url
else:
pass
else:
url = main_url+url
if '#' in url:
new_url = str(url).replace('#','/')
else:
new_url =url
count = count+1
if new_url:
print""+str(count)+">>",new_url
html = urllib.urlopen(new_url).read()
page_text_data=text_from_html(html)
with open("c:/Scrapy/Extracted/"+folder_name+"/"+txtfile_name+".txt", "a") as myfile:
myfile.writelines("\n\n"+new_url.encode('utf-8')+"\n\n"+page_text_data.encode('utf-8'))
path ='c:/Scrapy/Extracted/'+folder_name+"/"
filename ="url"+str(count)+".txt"
with open(os.path.join(path, filename), 'wb') as temp_file:
temp_file.write(page_text_data.encode('utf-8'))
temp_file.close()
else:
pass
答案 0 :(得分:1)
您当前的解决方案根本不涉及Scrapy。但正如你专门针对Scrapy所说的那样,你走了。
将您的蜘蛛基于CrawlSpider
课程。这允许您抓取给定的网站,并可能指定导航应遵守的规则。
要禁止异地请求,请使用allowed_domains
spider属性。或者,如果您使用CrawlSpider
类,则可以在allow_domains
中详细说明deny_domains
Rule
settings.py
构造函数的$arr = array(0 => array('id' => "AMO"), 1 => array('id' => "PAT"));
属性。
要限制抓取深度,请在array(
'AMO' => array(),
'PAT' => array()
)
中使用LinkExtractor
。
答案 1 :(得分:0)
你有一个标签scrapy,但你根本不使用它。我建议你尝试使用它 - 这很容易。比尝试自己开发容易得多。已经有一个选项来限制特定域的请求。