Question

我有一个基本HTTP URL和其他HTTP URL列表。我正在编写一个简单的爬虫/链接检查器作为研究（因此，不需要建议预先编写的工具），它检查基本URL是否有任何断开的链接，并递归地抓取所有其他“内部”页面（即。从同一网站内的基本URL链接的页面）具有相同的意图。最后，我必须输出链接列表及其状态（外部/内部，以及每个实际内部链接的警告，但显示为绝对URL。

到目前为止，我succeeded检查所有链接并使用请求和BeautifulSoup库进行爬网，但我找不到已经编写的方法来检查两个绝对URL是否指向同一个站点（除了拆分）沿着斜杠的URL，这对我来说似乎很难看。有一个着名的图书馆吗？

Answer 1

最后我和urlparse一起去了（kudos请@ padraic-cunningham指点我）。在代码的开头我解析“基本URL”（即我开始爬行的那个）：

base_parts = urlparse.urlparse(base_url)

然后为我找到的每个链接（例如for a in soup.find_all('a'):

link_parts = urlparse.urlparse(a.get('href'))

此时我必须比较URL方案（我认为使用不同URL方案的链接到同一站点，http或https，不同;我可能会在未来将此比较作为可选项）：

internal = base_parts.scheme == link_parts.scheme \
           and base_parts.netloc == link_parts.netloc

并且在此处，如果链接指向与我的基本URL相同的服务器（具有相同的方案），则internal将为True。您可以查看最终结果here。

Answer 2

我自己写了一个爬虫。我希望这会对你有所帮助。基本上我所做的是将网址添加到/2/2/3/index.php等网站，这将使网站成为http://www.website.com/2/2/3/index.php。然后我将所有网站插入一个数组，检查我之前是否访问过该网站，如果我这样做，将不会访问该网站。此外，如果此网站中有一些不相关的网站，例如YouTube视频的链接，那么它不会抓取youtube或任何其他网站没有＆＃34;网站相关＆＃34;同样。

对于您的问题，我建议您将所有访问过的网站放入数组中，并使用for循环检查数组。如果URL与数组相同，则打印它。

我不确定这是你想要的，但至少我尝试过。我没有使用BeautifulSoup，它仍然有效，所以考虑将该模块放在一边。

我的剧本（更像是它的一部分。我也得到例外检查，所以不要惊慌）：

__author__ = "Sploit"


# This part is about import the default python modules and the modules that the user have to download
# If the module does not exist, the script asks him to install that specific module

import os  # This module provides a portable way of using operating system dependent functionality
import urllib  # The urllib module provides a simple interface for network resource access
import urllib2  # The urllib2 module provides a simple interface for network resource access
import time  # This module provides various time-related functions
import urlparse  # This module defines a standard interface to break URL strings up in components
                 # to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.
import mechanize

print ("Which website would you like to crawl?")
website_url = raw_input("--> ")

# Ads http:// to the given URL because it is the only way to check for server response
# If the user will add to the URL directions then they will be deleted
# Example: 'https://moz.com/learn/seo/external-link' will turn to 'https://moz.com/'
if website_url.split('//')[0] != 'http:' and website_url.split('//')[0] != 'https:':
    website_url = 'http://' + website_url
website_url = website_url.split('/')[0] + '//' + website_url.split('/')[2]

# The user will stuck in a loop until a valid website is exist, using the application layer of the OSI module, HTTP Protocol
while True:
    try:
        if urllib2.urlopen(website_url).getcode() != 200:
            print ("Invalid URL given. Which website would you like to crawl?")
            website_url = raw_input("--> ")
        else:
            break
    except:
        print ("Invalid URL given. Which website would you like to crawl?")
        website_url = raw_input("--> ")

# This part is the actual the Web Crawler
# What it does is to search for links
# All the URLs that are not the websites URLs are printed in a txt file named "Non website links"


fake_browser = mechanize.Browser()  # Set the starting point for the spider and initialize the a mechanize browser object
urls = [website_url]  # Create lists for the URLs that the script should go through
visited = [website_url]  # Create lists that we have visited in, to avoid multiplies
text_file = open("Non website links.txt", "w")  # We create a txt file for all the URLs that are not the websites URLs
text_file_url = open("Website links.txt", "w")  # We create a txt file for all the URLs that are the websites URLs

print ("Crawling : " + website_url)
print ("The crawler started at " + time.asctime(time.localtime()) + ". This may take a couple of minutes")  # To let the user know when the crawler started to work
# Since the amount of urls in the list is dynamic we just let the spider go until some last url didn't have new ones on the website
while len(urls) > 0:
    try:
        fake_browser.open(urls[0])
        urls.pop(0)
        for link in fake_browser.links():  # A loop which looking for all the images in the website
            new_website_url = urlparse.urljoin(link.base_url, link.url)  # Create a new url with the websites link that is acceptable as HTTP
            if new_website_url not in visited and website_url in new_website_url:  # If we have been in this website, don't enter the URL to the list, to avoid multiplies
                visited.append(new_website_url)
                urls.append(new_website_url)
                print ("Found: " + new_website_url)  # Print all the links that the crawler found
                text_file_url.write(new_website_url + '\n')  # Print the non-website URL to the txt file
            elif new_website_url not in visited and website_url not in new_website_url:
                visited.append(new_website_url)
                text_file.write(new_website_url + '\n')  # Print the non-website URL to the txt file
    except:
        print ("Link couldn't be opened")
        urls.pop(0)

text_file.close()  # Close the txt file, to prevent anymore writing to it
text_file_url.close()  # Close the txt file, to prevent anymore writing to it
print ("A txt file with all the website links has been created in your folder")
print ("Finished!!")

检查URL是否相对于另一个（即它们位于同一主机上）

2 个答案: