Question

我在一个简单的函数中使用BeautifulSoup来提取具有全部大写文本的链接：

def findAllCapsUrls(page_contents):
    """ given HTML, returns a list of URLs that have ALL CAPS text
    """
    soup = BeautifulSoup.BeautifulSoup(page_contents)
    all_urls = node_with_links.findAll(name='a')

    # if the text for the link is ALL CAPS then add the link to good_urls
    good_urls = []
    for url in all_urls:
        text = url.find(text=True)
        if text.upper() == text:
            good_urls.append(url['href'])

    return good_urls

在大多数情况下运行良好，但是由于页面上的HTML格式错误，一些页面无法在BeautifulSoup（或lxml，我也尝试过）中正确解析，从而导致对象没有（或只有一些）链接在里面。 “少数”可能听起来不是什么大不了的事，但是这个功能正在爬行器中使用，所以爬虫可能会找到数百个页面......

如何重构上述函数以不使用像BeautifulSoup这样的解析器？我一直在寻找如何使用正则表达式来做到这一点，但所有答案都说“使用BeautifulSoup”。或者，我开始研究如何“修复”格式错误的HTML以便解析，但我不认为这是最好的路线......

使用re或其他方法，可以使用与上述函数相同的替代解决方案是什么？

Answer 1

如果html页面格式不正确，那么很多解决方案都无法真正帮助您。 BeautifulSoup或其他解析库是解析html文件的方法。

如果您想了解图书馆路径，可以使用正则表达式匹配所有链接，使用[A-Z]

范围查看regular-expression-to-extract-url-from-an-html-link

Answer 2

当我需要解析一个非常破碎的HTML时，速度不是我使用selenium & webdriver自动化浏览器的最重要因素。

这是我知道的最耐用的html解析方式。检查this tutorial它显示了如何使用webdriver提取谷歌建议（代码在java中，但可以更改为python）。

Answer 3

我最终得到了正则表达式和BeautifulSoup的组合：

def findAllCapsUrls2(page_contents):
    """ returns a list of URLs that have ALL CAPS text, given
    the HTML from a page. Uses a combo of RE and BeautifulSoup
    to handle malformed pages.
    """
    # get all anchors on page using regex
    p = r'<a\s+href\s*=\s*"([^"]*)"[^>]*>(.*?(?=</a>))</a>'
    re_urls = re.compile(p, re.DOTALL)
    all_a = re_urls.findall(page_contents)

    # if the text for the anchor is ALL CAPS then add the link to good_urls
    good_urls = []
    for a in all_a:
        href = a[0]
        a_content = a[1]
        a_soup = BeautifulSoup.BeautifulSoup(a_content)
        text = ''.join([s.strip() for s in a_soup.findAll(text=True) if s])
        if text and text.upper() == text:
            good_urls.append(href)

    return good_urls

到目前为止，这适用于我的用例，但我不保证它可以在所有页面上工作。此外，如果原始功能失败，我只使用此功能。

如何使用Python（没有第三方解析器）查找所有大写文本的链接？

3 个答案: