Question

我试图从网站上获取链接。它的声音。该网站是http://dictionary.reference.com/browse/would?s=t

所以我使用以下代码来获取链接，但它是空白的。这很奇怪，因为我可以使用类似的设置并从库存中提取数据。我的想法是建立一个程序，给出单词的声音然后我会要求拼写。这对我的孩子来说非常多。我需要通过一个单词列表来获取字典中的链接，但无法获得打印出来的链接。我正在使用urllib并重新编写代码。

import urllib
import re
words = [ "would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)
    htmltext = htmlfile.read()
    regex = '<a class="speaker" href =>(.+?)</a>' #puts tag together
    pattern = re.compile(regex)
    link = re.findall(pattern, htmltext)
    print "the link for the word", word, link #should print link

这是单词http://static.sfdict.com/staticrep/dictaudio/W02/W0245800.mp3

的预期输出

Answer 1

您应该修复正则表达式以获取href属性值中的所有内容：

<a class="speaker" href="(.*?)"

请注意，您应该真正考虑switching from regex to HTML parsers，例如BeautifulSoup。

以下是在这种情况下如何应用BeautifulSoup：

import urllib

from bs4 import BeautifulSoup

words = ["would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)

    soup = BeautifulSoup(htmlfile, "html.parser")
    links = [link["href"] for link in soup.select("a.speaker")]

    print(word, links)

如何使用Python从一个html类中获取链接

1 个答案: