如何使用python脚本从标签列表中获取href?

时间:2019-03-14 13:41:20

标签: html python-3.x

我有一个HTML文件“ links.html”,我要从该HTML文件中提取href // www.medicineindia.org/medicine-brand- details / 8414 / capicare ,用于字符串 CAPICARE 。 如何使用python脚本

“ links.html”的代码为:

getCurrentUrl

1 个答案:

答案 0 :(得分:0)

您可以使用“简单”的正则表达式来完成该任务,该正则表达式可以利用捕获(和非捕获)组:

import re

html = ('<a itemprop="url" href="//www.medicineindia.org/medicine-brand'
        '-details/12220/cholstig"><span itemprop="name">CHOLSTIG</span></a><a '
        'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
        '/8414/capicare"><span itemprop="name">CAPICARE</span></a><a '
        'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
        '/230/cyclozobid"><span itemprop="name">CYCLOZOBID</span></a><a '
        'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
        '/6855/cinkona"><span itemprop="name">CINKONA</span></a>')

regex = '(?:href=")([^"]+)(?:.*?<span.*?>)(.*?)(?:</span>)'

matches = re.findall(regex, html)

for m in matches:
    print(f'Brand: {m[1]}, URL: {m[0]}')

这将输出以下内容:

Brand: CHOLSTIG, URL: //www.medicineindia.org/medicine-brand-details/12220/cholstig
Brand: CAPICARE, URL: //www.medicineindia.org/medicine-brand-details/8414/capicare
Brand: CYCLOZOBID, URL: //www.medicineindia.org/medicine-brand-details/230/cyclozobid
Brand: CINKONA, URL: //www.medicineindia.org/medicine-brand-details/6855/cinkona

通过迭代元组列表matches来格式化输出,其中链接与其对应的“跨度”内容匹配。