Question

我有一个程序，它接收一个网站的源代码/ html并输出一个href标签 - 它非常有用并且使用了BeautifulSoup4。

我希望这个代码的变体只能看到＆lt; a href =＆＃34; ...＆＃34;＆gt;标签，但只返回网站源代码中的顶级目录主机名，例如

stackoverflow.com
google.com

等。但不是像stackoverflow.com/questions/等低层次的。现在它输出所有内容，包括/，＃t8等，我需要将它们过滤掉。

这是我用来提取所有href标签的当前代码。

url = sys.argv[1] #when program is invoked, takes it in like www.google.com etc.
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# get hosts
for a in soup.find_all('a', href=True):
    print a['href']

谢谢！

Answer 1

听起来您正在寻找.netloc的{{1}}属性。它是Python标准库的一部分：https://docs.python.org/2/library/urlparse.html

例如：

urlparse

Python HTML解析：获取站点顶级主机

1 个答案: