Question

我正在尝试从onclick上获取价值，但是下面的代码给了我一个空白列表。我究竟做错了什么？

from html.parser import HTMLParser
from urllib.request import urlopen
from re import findall,search

class myParser(HTMLParser):
    sites = []
    def handle_starttag(self, tag, attrs):
        if tag == 'div':
            for attr in attrs:
                if attr[0] == 'onclick':

                    self.sites.append(attr[1])

    def get_sites(self):
        return self.sites

website = urlopen('https://www.classicshorts.com/stories/')
html = website.read()
content = html.decode()
url = findall('"openStory\(\'/stories/[a-z]+\.[a-z]+\'\)"',content)
print(url)

Parser = myParser()
link = Parser.feed(content)
sites = Parser.get_sites()

但是这段代码给了我一个空列表。

我想获取的网址是

<div onclick="openStory('/stories/aos.html')" class="storylisting">
                <span style="font-weight:bold;">An Affair of State</span>--Guy de Maupassant (1850-1893)
            </div>

重要提示：如果有任何提示，则所有链接都具有相同的class ='storylisting'

Answer 1

您的正则表达式不是问题，而是您的页面。打印“内容”，然后看看自己。那里没有任何onclick =“ openStory（'...'）”。无论如何，用下面的代码测试您的正则表达式（仅添加（），以便它只选择实际的链接）可以正常工作：

from re import findall,search

content = r"""<div onclick="openStory('/stories/aos.html')" class="storylisting">
<span style="font-weight:bold;">An Affair of State</span>--Guy de Maupassant (1850-1893)
</div>"""

url = findall('"openStory\(\'(/stories/[a-z]+\.[a-z]+)\'\)"',content)
print(url)

编辑：添加一段解析所有故事URL的代码

from urllib.request import urlopen
from re import findall

base_url = "https://www.classicshorts.com"
link_pages = ["/abc/a-d.html", "/abc/e-h.html", "/abc/i-m.html", "/abc/n-s.html", "/abc/t-z.html"]

for page in link_pages:
    content = urlopen(base_url + page).read().decode()
    for story_url in findall('"openStory\(\'(/stories/[a-z]+\.[a-z]+)\'\)"',content):
        print(base_url + story_url)

如何通过Python从onclick中获取价值

1 个答案: