Question

我试图通过计算内部链接总数与网页中所有链接的总数来找出特定网站的内部链接比率。如果比率> = 0.5，它将返回-1，否则它将返回-1返回1。

def get_domain_name(url):
   splitted = urlparse(url).hostname.split('.')
   return  splitted[-2] + '.' + splitted[-1]


def internal_link(url):
   icount = 0
   count = 0

   base_domain_name = get_domain_name(url)
   page = requests.get(url)
   soup = BeautifulSoup(page.content,'html.parser')
   href_links = soup.find_all('a',href=True)
   for link in href_links:
       count = count + 1
       child_domain_name = get_domain_name(link)
       if child_domain_name == base_domain_name:
           icount = icount + 1   

   if count != 0 :
       ilink_ratio = icount/count
        #elink_ratio = 1 - ilink_ratio
   else: 
       ilink_ratio = 0


   if ilink_ratio >= 0.5:
       return -1
   else:
       return 1


ans = internal_link('https://www.google.com')
print(ans)

我的预期输出将是1或-1，但我得到

ans = internal_link('https://www.google.com')

Traceback (most recent call last):

File "<ipython-input-76-aa8cd733497f>", line 1, in <module>
ans = internal_link('https://www.google.com')

File "<ipython-input-75-36aae0789101>", line 17, in internal_link
child_domain_name = get_domain_name(link)

File "<ipython-input-75-36aae0789101>", line 2, in get_domain_name
splitted = urlparse(url).hostname.split('.')

File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 367, in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)

File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 123, in _coerce_args
return _decode_args(args) + (_encode_result,)

File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 107, in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)

File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 107, in <genexpr>
return tuple(x.decode(encoding, errors) if x else '' for x in args)

File "C:\Users\Dell\Anaconda3\lib\site-packages\bs4\element.py", line 1181, in decode
indent_space = (' ' * (indent_level - 1))

TypeError: unsupported operand type(s) for -: 'str' and 'int'

Answer 1

您的get_domain_name函数无法处理所有极端情况。

def get_domain_name(url):
   if url:
       print(url)
       hostname = urlparse(url).hostname
       if hostname is not None and '.' in hostname:
           splitted = hostname.split('.')
           return  splitted[-2] + '.' + splitted[-1]

这可能会起作用。但是您在寻找其他极端情况。我只尝试使用您在帖子中提到的网址。

-：'str'和'int'的不支持的操作数类型：在爬网网站上比较基本域名和子域名时

1 个答案: