Question

大家好。

我非常困于一个问题，整天都在打我的头。协助将不胜感激。

我有一个本地压缩文件，该压缩文件具有index.html，其HREF URL指向具有更多URL的另一个本地目录。任务是使用漂亮的汤从初始HTML文件获得链接，使用HTML文件跟踪那些指向其他本地目录的链接，并从中检索URL。

到目前为止，我能够检索URL，并且由于某些原因，这些URL是重复的。仍然看不到本地目录中包含的所有HTML文档。

from tkinter import *
from zipfile import ZipFile
import os
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from pathlib import Path
import re
from urllib.parse import urljoin

documentList = {}
test = ""
test2 = {}
invertedIndexList = []


file = ZipFile("rhf.zip", "r")
fullpath = "rhf/"

def get_file_html(file_path):
    return BeautifulSoup(file.read(file_path).decode("utf-8", errors="ignore"), "html.parser")


def get_index_html():
    htmlSource = get_file_html("rhf/index.html")
    file_path = htmlSource.find("a")
    url = "rhf/" + file_path["href"]
    return get_file_html(url)

def main():
    test = get_index_html()
    fileName = "test.html"
    with open(fileName, "wt") as f:
        for link in test.find_all("a", {"href": re.compile('.htm')}):
            test2 = link.get("href")
            test3 = "<a href='" + test2 + "'>" + test2 + "</a><br>"
            print(test3)
            f.write(test3)

if __name__ == "__main__":
    main()

如何使用Beautiful Soup抓取HREF URL，跟踪URL并从中抓取？

0 个答案: