Question

我有python代码，我希望分别获取主机名和路径。例如www.stackoverflow.com/questions/ask我想要这样的结果＆＃34;主机名是：www.stackoverflow.com，路径是：/ questions / ask＆＃34;

这是我的python代码：

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
import socket
import errno
import io
from nyt4 import articalText

url = "http://www.nytimes.com/section/health"
br = mechanize.Browser()
br.set_handle_equiv(False) 
htmltext = br.open(url)
#htmltext = urllib.urlopen(url).read()
soup = BeautifulSoup(htmltext)
maindiv = soup.findAll('section', attrs={'class':'health-collection collection'})
for links in  maindiv:
    atags = soup.findAll('a',href=True)
    for link in atags:
        alinks= link.get('href')
        print alinks.hostname
        print alinks.path

但是这段代码给了我这个错误：

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    execfile("nytimes/test2.py")
  File "nytimes/test2.py", line 21, in <module>
    print alinks.hostname
AttributeError: 'unicode' object has no attribute 'hostname'

Answer 1

alinks= link.get('href')设置alink到一个绝对没有主机名或路径属性的字符串，您可以使用urlparse来获取路径和主机名< / EM>：

import mechanize from bs4 import BeautifulSoup from urlparse import urlparse url = "http://www.nytimes.com/section/health" br = mechanize.Browser() br.set_handle_equiv(False) htmltext = br.open(url) #htmltext = urllib.urlopen(url).read() soup = BeautifulSoup(htmltext) maindiv = soup.find_all('section', attrs={'class':'health-collection collection'}) for links in maindiv: atags = soup.find_all('a',href=True) for link in atags: alinks = urlparse(link.get('href')) print alinks.hostname print alinks.path

我希望在Python中将主机名和路径与href标签分开

1 个答案: