我有一个基本HTTP URL和其他HTTP URL列表。我正在编写一个简单的爬虫/链接检查器作为研究(因此,不需要建议预先编写的工具),它检查基本URL是否有任何断开的链接,并递归地抓取所有其他“内部”页面(即。从同一网站内的基本URL链接的页面)具有相同的意图。最后,我必须输出链接列表及其状态(外部/内部,以及每个实际内部链接的警告,但显示为绝对URL。
到目前为止,我succeeded检查所有链接并使用请求和BeautifulSoup库进行爬网,但我找不到已经编写的方法来检查两个绝对URL是否指向同一个站点(除了拆分)沿着斜杠的URL,这对我来说似乎很难看。有一个着名的图书馆吗?
答案 0 :(得分:1)
最后我和urlparse一起去了(kudos请@ padraic-cunningham指点我)。在代码的开头我解析“基本URL”(即我开始爬行的那个):
base_parts = urlparse.urlparse(base_url)
然后为我找到的每个链接(例如for a in soup.find_all('a'):
link_parts = urlparse.urlparse(a.get('href'))
此时我必须比较URL方案(我认为使用不同URL方案的链接到同一站点,http或https,不同;我可能会在未来将此比较作为可选项):
internal = base_parts.scheme == link_parts.scheme \
and base_parts.netloc == link_parts.netloc
并且在此处,如果链接指向与我的基本URL相同的服务器(具有相同的方案),则internal将为True
。您可以查看最终结果here。
答案 1 :(得分:0)
我自己写了一个爬虫。我希望这会对你有所帮助。基本上我所做的是将网址添加到/2/2/3/index.php等网站,这将使网站成为http://www.website.com/2/2/3/index.php。然后我将所有网站插入一个数组,检查我之前是否访问过该网站,如果我这样做,将不会访问该网站。此外,如果此网站中有一些不相关的网站,例如YouTube视频的链接,那么它不会抓取youtube或任何其他网站没有"网站相关"同样。
对于您的问题,我建议您将所有访问过的网站放入数组中,并使用for循环检查数组。如果URL与数组相同,则打印它。
我不确定这是你想要的,但至少我尝试过。我没有使用BeautifulSoup,它仍然有效,所以考虑将该模块放在一边。
我的剧本(更像是它的一部分。我也得到例外检查,所以不要惊慌):
__author__ = "Sploit"
# This part is about import the default python modules and the modules that the user have to download
# If the module does not exist, the script asks him to install that specific module
import os # This module provides a portable way of using operating system dependent functionality
import urllib # The urllib module provides a simple interface for network resource access
import urllib2 # The urllib2 module provides a simple interface for network resource access
import time # This module provides various time-related functions
import urlparse # This module defines a standard interface to break URL strings up in components
# to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.
import mechanize
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
# Ads http:// to the given URL because it is the only way to check for server response
# If the user will add to the URL directions then they will be deleted
# Example: 'https://moz.com/learn/seo/external-link' will turn to 'https://moz.com/'
if website_url.split('//')[0] != 'http:' and website_url.split('//')[0] != 'https:':
website_url = 'http://' + website_url
website_url = website_url.split('/')[0] + '//' + website_url.split('/')[2]
# The user will stuck in a loop until a valid website is exist, using the application layer of the OSI module, HTTP Protocol
while True:
try:
if urllib2.urlopen(website_url).getcode() != 200:
print ("Invalid URL given. Which website would you like to crawl?")
website_url = raw_input("--> ")
else:
break
except:
print ("Invalid URL given. Which website would you like to crawl?")
website_url = raw_input("--> ")
# This part is the actual the Web Crawler
# What it does is to search for links
# All the URLs that are not the websites URLs are printed in a txt file named "Non website links"
fake_browser = mechanize.Browser() # Set the starting point for the spider and initialize the a mechanize browser object
urls = [website_url] # Create lists for the URLs that the script should go through
visited = [website_url] # Create lists that we have visited in, to avoid multiplies
text_file = open("Non website links.txt", "w") # We create a txt file for all the URLs that are not the websites URLs
text_file_url = open("Website links.txt", "w") # We create a txt file for all the URLs that are the websites URLs
print ("Crawling : " + website_url)
print ("The crawler started at " + time.asctime(time.localtime()) + ". This may take a couple of minutes") # To let the user know when the crawler started to work
# Since the amount of urls in the list is dynamic we just let the spider go until some last url didn't have new ones on the website
while len(urls) > 0:
try:
fake_browser.open(urls[0])
urls.pop(0)
for link in fake_browser.links(): # A loop which looking for all the images in the website
new_website_url = urlparse.urljoin(link.base_url, link.url) # Create a new url with the websites link that is acceptable as HTTP
if new_website_url not in visited and website_url in new_website_url: # If we have been in this website, don't enter the URL to the list, to avoid multiplies
visited.append(new_website_url)
urls.append(new_website_url)
print ("Found: " + new_website_url) # Print all the links that the crawler found
text_file_url.write(new_website_url + '\n') # Print the non-website URL to the txt file
elif new_website_url not in visited and website_url not in new_website_url:
visited.append(new_website_url)
text_file.write(new_website_url + '\n') # Print the non-website URL to the txt file
except:
print ("Link couldn't be opened")
urls.pop(0)
text_file.close() # Close the txt file, to prevent anymore writing to it
text_file_url.close() # Close the txt file, to prevent anymore writing to it
print ("A txt file with all the website links has been created in your folder")
print ("Finished!!")