我正在尝试使用以下代码在给定的URL(工作中的内部网站)中搜索关键字,但我不断收到错误消息。在公共站点上运行良好。
from html.parser import HTMLParser
import urllib.request
class CustomHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.tag_flag = False
self.tag_line_num = 0
self.tag_string = 'temporary_tag'
def initiate_vars(self, tag_string):
self.tag_string = tag_string
def handle_starttag(self, tag, attrs):
#if tag == 'tag_to_search_for':
if tag == self.tag_string:
self.tag_flag = True
self.tag_line_num = self.getpos()
if __name__== '__main__':
#simple_str = 'string_to_search_for'
simple_str = 'Host Status'
my_url = 'TEST_URL'
parser_obj = CustomHTMLParser()
#parser_obj.initiate_vars('tag_to_search_for')
parser_obj.initiate_vars('script')
#html_file = open('location_of_html_file//file.html')
my_request = urllib.request.Request(my_url)
try:
url_data = urllib.request.urlopen(my_request)
except:
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
#html_str = html_file.read()
#print (html_str)
html_search_result = html_str.lower().find(simple_str.lower())
if html_search_result != -1:
print ('The word {} was found'.format(simple_str))
else:
print ('The word {} was not found'.format(simple_str))
parser_obj.feed(html_str)
if parser_obj.tag_flag:
print ('Tag {0} was found at position {1}'.format(parser_obj.tag_string, parser_obj.tag_line_num))
else:
print ('Tag {} was not found'.format(parser_obj.tag_string))
但我不断收到错误消息
There was some error opening the URL
Traceback (most recent call last):
File "C:\TEMP\parse.py", line 40, in <module>
html_str = url_data.read().decode('utf8')
NameError: name 'url_data' is not defined
我相信我已经尝试使用urllib2和python v3.7
不确定该怎么做。值得尝试使用user_agent吗?
EDIT1:我现在已经尝试了以下
>>> import urllib
>>> url = urllib.request.urlopen('https://concernedURL.com')
,并且我收到此错误“ urllib.error.HTTPError:HTTP错误401:未经授权”。我应该使用浏览器中的标头以及SSL证书吗?
答案 0 :(得分:1)
问题是您在try
块中遇到错误,并使url_data
变量未定义:
try:
# if this errors, no url_data will exist
url_data = urllib.request.urlopen(my_request)
except:
# really bad to catch all exceptions!
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
您应该只删除try-except
,或者更好地处理错误。在没有特定错误的情况下使用裸except
几乎是不可取的,因为它会引起各种问题。
在这种情况下,如果您无法打开请求的URL,则程序可能应该停止运行,因为如果打开失败首先尝试对URL的数据进行操作并没有任何意义。