NameError:名称“ url_data”未定义

时间:2019-05-03 06:22:16

标签: python python-3.x web-scraping

我正在尝试使用以下代码在给定的URL(工作中的内部网站)中搜索关键字,但我不断收到错误消息。在公共站点上运行良好。

from html.parser import HTMLParser
import urllib.request

class CustomHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.tag_flag = False
        self.tag_line_num = 0
        self.tag_string = 'temporary_tag'

    def initiate_vars(self, tag_string):
        self.tag_string = tag_string

    def handle_starttag(self, tag, attrs):
        #if tag == 'tag_to_search_for':
        if tag == self.tag_string:
            self.tag_flag = True
            self.tag_line_num = self.getpos()


if __name__== '__main__':
    #simple_str = 'string_to_search_for'
    simple_str = 'Host Status'

    my_url = 'TEST_URL'

    parser_obj = CustomHTMLParser()

    #parser_obj.initiate_vars('tag_to_search_for')
    parser_obj.initiate_vars('script')

    #html_file = open('location_of_html_file//file.html')
    my_request = urllib.request.Request(my_url)

    try:
        url_data = urllib.request.urlopen(my_request)
    except:
        print("There was some error opening the URL")

    html_str = url_data.read().decode('utf8')
    #html_str = html_file.read()

    #print (html_str)

    html_search_result = html_str.lower().find(simple_str.lower())
    if html_search_result != -1:
        print ('The word {} was found'.format(simple_str))
    else:
        print ('The word {} was not found'.format(simple_str))

    parser_obj.feed(html_str)

    if parser_obj.tag_flag:
        print ('Tag {0} was found at position {1}'.format(parser_obj.tag_string, parser_obj.tag_line_num))
    else:
        print ('Tag {} was not found'.format(parser_obj.tag_string))

但我不断收到错误消息

There was some error opening the URL
Traceback (most recent call last):
  File "C:\TEMP\parse.py", line 40, in <module>
    html_str = url_data.read().decode('utf8')
NameError: name 'url_data' is not defined

我相信我已经尝试使用urllib2和python v3.7

不确定该怎么做。值得尝试使用user_agent吗?

EDIT1:我现在已经尝试了以下

>>> import urllib
>>> url = urllib.request.urlopen('https://concernedURL.com')

,并且我收到此错误“ urllib.error.HTTPError:HTTP错误401:未经授权”。我应该使用浏览器中的标头以及SSL证书吗?

1 个答案:

答案 0 :(得分:1)

问题是您在try块中遇到错误,并使url_data变量未定义:

try:
    # if this errors, no url_data will exist
    url_data = urllib.request.urlopen(my_request)
except:
    # really bad to catch all exceptions!
    print("There was some error opening the URL")

html_str = url_data.read().decode('utf8')

您应该只删除try-except,或者更好地处理错误。在没有特定错误的情况下使用裸except几乎是不可取的,因为它会引起各种问题。

在这种情况下,如果您无法打开请求的URL,则程序可能应该停止运行,因为如果打开失败首先尝试对URL的数据进行操作并没有任何意义。