Question

几天前，我获得了在一家公司进行大学元数据分析项目的许可，我想今天开始测试。我当时使用的是使用BeautifulSoup在python中制作的几个工具，但发现它们都不起作用。他们将转到给定的URL以打开它，然后不像应有的那样爬行。我去了网站，发现他们没有使用<a>标签来指定具有HREF的链接，但是他们正在使用

<link rel="alternate" type="redacted" title="<redacted>" » ICal Feed" href="<link>

该如何更改？老实说，我不确定这条线到底是什么。我精通python，但没有那么多HTML。

下面的这段代码也是我搜索蜘蛛的链接的代码。然后，我将它们附加到双端队列python对象。

    soup = BeautifulSoup(response.text, 'lxml')

    #determine spidering links
    for anchor in soup.find_all("link"):
        link = anchor.attrs["href"] if "href" in anchor.attrs and anchor.attrs["href"].find("mailto") == -1 and anchor.attrs["href"].find("tel") == -1 and anchor.attrs["href"].find("#") == -1 else ''

        if link.startswith('/'):
            link = base_url + link
        elif not link.startswith('http'):
            link = path + link
        if not link in new_urls and not link in processed_urls and not link.find(start) == -1:
            new_urls.append(link)

Answer 1

要从您的html示例中获取链接：

tag = soup.findAll('link')

[i["href"] for i in tag]

需要帮助识别链接以捕获网站上的内容（渗透测试）

1 个答案: