Question

我的代码是搜索在命令提示符下传递的链接，在链接上获取网页的HTML代码，在HTML代码中搜索网页上的链接，然后对找到的链接重复这些步骤。我希望这很清楚。

它应该打印出任何导致错误的链接。

更多需要信息：

它可以做的最大访问量是100。如果网站有错误，则返回None值。

Python3就是我正在使用的

例如

s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.

该网站的HTML代码的链接在其网页上以p2.html，p3.html，p4.html和p5.html结尾。我的代码会读取所有这些内容，但它不会单独访问这些链接以搜索更多链接。如果这样做，它应搜索这些链接并找到以p10.html结尾的链接，然后它应报告以p10.html结尾的链接有错误。显然，目前它并没有这样做，而且给了我一些困难。

我的代码..

    url = args.url[0]
    url_list = [url]
    checkedURLs = []
    AmountVisited = 0
    while (url_list and AmountVisited<maxhits):
        url = url_list.pop()
        s = readwebpage(url)
        print("testing url: http",url)                  #Print the url being tested, this code is here only for testing..
        AmountVisited = AmountVisited + 1
        if s == None:
            print("* bad reference to http", url)
        else:
            urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
            while urls_list:                                          #... http or https
                insert = urls_list.pop()            
                while(insert in checkedURLs and urls_list):
                    insert = urls_list.pop()
                url_list.append(insert)
                checkedURLs = insert

请帮助：）

Answer 1

这是您想要的代码。但是，请停止使用正则表达式来解析HTML。 BeautifulSoup是实现这一目标的方法。

import re
from urllib import urlopen

def readwebpage(url):
  print "testing ",current     
  return urlopen(url).read()

url = 'http://xrisk.esy.es' #put starting url here

yet_to_visit= [url]
visited_urls = []

AmountVisited = 0
maxhits = 10

while (yet_to_visit and AmountVisited<maxhits):

    print yet_to_visit
    current = yet_to_visit.pop()
    AmountVisited = AmountVisited + 1
    html = readwebpage(current)


    if html == None:
        print "* bad reference to http", current
    else:
        r = re.compile('(?<=href=").*?(?=")')
        links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
        for u in links:

          if u in visited_urls: 
            continue
          elif u.find('http')!=-1:
            yet_to_visit.append(u)
        print links
    visited_urls.append(current)

Answer 2

我怀疑你的正则表达式是你问题的一部分。现在，您在捕获组外面有http，[\s:]匹配“某种空格（即\s）或：”

我将正则表达式更改为：urls_list = re.findall(r'href="(.*)"',s)。也称为“匹配引号中的任何内容，在href =”之后。如果您确实需要确保http [s]：//，请使用r'href="(https?://.*)"'（s? =＆gt;一个或零s）

编辑：实际使用正则表达式，使用非贪婪的glom：href=(?P<q>[\'"])(https?://.*?)(?P=q)'

（另外，呃，虽然在你的情况下因为re缓存在技术上不是必需的，但我认为养成使用re.compile的习惯是一种好习惯。）

我认为您的所有网址都是完整的网址非常好。您是否必须处理相对URL？ `

Answer 3

不是Python但是因为你提到你并没有严格地与regex绑定，我认为你可能会在使用wget时找到一些用途。

wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com

细分：

--spider：当使用此选项调用时，Wget将表现为Web蜘蛛，这意味着它不会下载页面，只需检查它们是否在那里。
-o C:\wget.log：将所有邮件记录到C：\ wget.log -e robots=off：忽略robots.txt
-w 1：设置1秒的等待时间 -r：设置递归搜索 -l 10：将递归深度设置为10，这意味着wget只会达到10级，这可能需要根据您的最大请求进行更改
http://www.stackoverflow.com：您要以

开头的网址

完成后，您可以查看wget.log条目，通过搜索HTTP状态代码404等内容来确定哪些链接存在错误。

检查来自源HTML的链接中的所有链接，Python

3 个答案: