Question

我正在从Wiki页面提取URL链接，并在尝试解析某些链接时出现“ ValueError”错误。我正在寻找一种忽略错误或解决问题的方法。看来，当循环提取链接时，它会遇到无法识别为链接和回溯的链接。

from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
import ssl
import re

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input("Enter First Link: ")
if len(url)<1:   url = "https://www.bing.com/search?q=k+means+wiki&src=IE-SearchBox&FORM=IENAD2"

position = 18
process = 7

#to repeat 18 times#
for i in range(process):
    html = urllib.request.urlopen(url, context=ctx)
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    count = 0
    for tag in tags:
        count = count +1
        #make it stop at position 3#
        if count>position:
            break
        url = tag.get('href', None)

        print(url)

提高：

ValueError      Traceback (most recent call last)

ValueError: unknown url type: '/search?q=Cluster+analysis%20wikipedia&FORM=WIKIRE'

Answer 1

它遇到的URL没有架构或域。这是一个相对网址，表示需要将其附加到当前页面网址后才能转到该网址。网址通常以{：// {1}}中的schema：//domain.domain开头。如果您检查自己的网址以确保它们包含架构和域，然后将其附加（如果缺少），则可以避免此错误。

一个例子：

https://www.facebook.com

可能是Google上搜索堆栈溢出的相对网址。

要重构完整的url，您只需在开头添加/search?q=stack+overflow，它就会变成实际的搜索链接https://www.google.com

Answer 2

出现错误的原因是因为该链接无效。您可以尝试在URL的开头加上“ https://bing.com”，或者捕获错误。

要捕获错误：

pytest.mark.django_db

要在URL之前添加

：

try:
    url = tag.get('href', None)
except ValueError:
    print("Invalid URL")

Answer 3

https://docs.python.org/3/tutorial/errors.html#errors-and-exceptions

有关错误和异常，请参见python文档。

您可以将其放入循环中：

for i in range(process):
    try:
        "line of code causes the problem"
    except ValueError:
        print("invalid url")

希望有帮助。

从提取链接中获取ValueError

3 个答案: