我试图通过计算内部链接总数与网页中所有链接的总数来找出特定网站的内部链接比率。如果比率> = 0.5,它将返回-1,否则它将返回-1返回1。
def get_domain_name(url):
splitted = urlparse(url).hostname.split('.')
return splitted[-2] + '.' + splitted[-1]
def internal_link(url):
icount = 0
count = 0
base_domain_name = get_domain_name(url)
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
href_links = soup.find_all('a',href=True)
for link in href_links:
count = count + 1
child_domain_name = get_domain_name(link)
if child_domain_name == base_domain_name:
icount = icount + 1
if count != 0 :
ilink_ratio = icount/count
#elink_ratio = 1 - ilink_ratio
else:
ilink_ratio = 0
if ilink_ratio >= 0.5:
return -1
else:
return 1
ans = internal_link('https://www.google.com')
print(ans)
我的预期输出将是1或-1,但我得到
ans = internal_link('https://www.google.com')
Traceback (most recent call last):
File "<ipython-input-76-aa8cd733497f>", line 1, in <module>
ans = internal_link('https://www.google.com')
File "<ipython-input-75-36aae0789101>", line 17, in internal_link
child_domain_name = get_domain_name(link)
File "<ipython-input-75-36aae0789101>", line 2, in get_domain_name
splitted = urlparse(url).hostname.split('.')
File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 367, in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)
File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 123, in _coerce_args
return _decode_args(args) + (_encode_result,)
File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 107, in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)
File "C:\Users\Dell\Anaconda3\lib\urllib\parse.py", line 107, in <genexpr>
return tuple(x.decode(encoding, errors) if x else '' for x in args)
File "C:\Users\Dell\Anaconda3\lib\site-packages\bs4\element.py", line 1181, in decode
indent_space = (' ' * (indent_level - 1))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
答案 0 :(得分:0)
您的get_domain_name
函数无法处理所有极端情况。
def get_domain_name(url):
if url:
print(url)
hostname = urlparse(url).hostname
if hostname is not None and '.' in hostname:
splitted = hostname.split('.')
return splitted[-2] + '.' + splitted[-1]
这可能会起作用。但是您在寻找其他极端情况。我只尝试使用您在帖子中提到的网址。