使用BeautifulSoup搜寻网站目录吗?

时间:2019-06-20 00:00:05

标签: python beautifulsoup web-crawler

这是我的代码: https://pastebin.com/R11qiTF4

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from urllib.parse import urljoin
import re

urls = ["https://www.helios-gesundheit.de"]
domain_list = ["https://www.helios-gesundheit.de/kliniken/schwerin/"]
prohibited = ["info", "news"]
text_keywords = ["Helios", "Helios"]
url_list = []

desired = "https://www.helios-gesundheit.de/kliniken/schwerin/unser-angebot/unsere-fachbereiche-klinikum/allgemein-und-viszeralchirurgie/team-allgemein-und-viszeralchirurgie/"

for x in range(len(domain_list)):
    url_list.append(urls[x]+domain_list[x].replace(urls[x], ""))

print(url_list)

def prohibitedChecker(prohibited_list, string):
    for x in prohibited_list:
        if x in string:
            return True
        else:
            return False
        break

def parseHTML(url):
    requestHTML = req(url)
    htmlPage = requestHTML.read()
    requestHTML.close()
    parsedHTML = soup(htmlPage, "html.parser")
    return parsedHTML

searched_word = "Helios"

for url in url_list:
    parsedHTML = parseHTML(url)
    href_crawler = parsedHTML.find_all("a", href=True)
    for href in href_crawler:
        crawled_url = urljoin(url,href.get("href"))
        print(crawled_url)
        if "www" not in crawled_url:
            continue
        parsedHTML = parseHTML(crawled_url)
        results = parsedHTML.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)
        for single_result in results:
            keyword_text_check = prohibitedChecker(text_keywords, single_result.string)
            if keyword_text_check != True:
                continue
            print(single_result.string)

我正在尝试打印“所需”变量的内容。问题如下,我的代码甚至没有请求“所需”的URL,因为它不在网站范围内。 “所需” href链接位于我当前正在抓取的页面内的另一个href链接内。我以为可以通过在第39行for循环中添加另一个for循环来解决此问题,该循环请求在我的第一个中找到的每个href,但这太乱了,效率不高

是否可以获取网站网址的每个目录的列表?

0 个答案:

没有答案