我正在尝试从onclick上获取价值,但是下面的代码给了我一个空白列表。我究竟做错了什么?
from html.parser import HTMLParser
from urllib.request import urlopen
from re import findall,search
class myParser(HTMLParser):
sites = []
def handle_starttag(self, tag, attrs):
if tag == 'div':
for attr in attrs:
if attr[0] == 'onclick':
self.sites.append(attr[1])
def get_sites(self):
return self.sites
website = urlopen('https://www.classicshorts.com/stories/')
html = website.read()
content = html.decode()
url = findall('"openStory\(\'/stories/[a-z]+\.[a-z]+\'\)"',content)
print(url)
Parser = myParser()
link = Parser.feed(content)
sites = Parser.get_sites()
但是这段代码给了我一个空列表。
我想获取的网址是
<div onclick="openStory('/stories/aos.html')" class="storylisting">
<span style="font-weight:bold;">An Affair of State</span>--Guy de Maupassant (1850-1893)
</div>
重要提示:如果有任何提示,则所有链接都具有相同的class ='storylisting'
答案 0 :(得分:0)
您的正则表达式不是问题,而是您的页面。打印“内容”,然后看看自己。那里没有任何onclick =“ openStory('...')”。无论如何,用下面的代码测试您的正则表达式(仅添加(),以便它只选择实际的链接)可以正常工作:
from re import findall,search
content = r"""<div onclick="openStory('/stories/aos.html')" class="storylisting">
<span style="font-weight:bold;">An Affair of State</span>--Guy de Maupassant (1850-1893)
</div>"""
url = findall('"openStory\(\'(/stories/[a-z]+\.[a-z]+)\'\)"',content)
print(url)
编辑:添加一段解析所有故事URL的代码
from urllib.request import urlopen
from re import findall
base_url = "https://www.classicshorts.com"
link_pages = ["/abc/a-d.html", "/abc/e-h.html", "/abc/i-m.html", "/abc/n-s.html", "/abc/t-z.html"]
for page in link_pages:
content = urlopen(base_url + page).read().decode()
for story_url in findall('"openStory\(\'(/stories/[a-z]+\.[a-z]+)\'\)"',content):
print(base_url + story_url)