我有一个HTML文件“ links.html”,我要从该HTML文件中提取href // www.medicineindia.org/medicine-brand- details / 8414 / capicare ,用于字符串 CAPICARE 。 如何使用python脚本
“ links.html”的代码为:
getCurrentUrl
答案 0 :(得分:0)
您可以使用“简单”的正则表达式来完成该任务,该正则表达式可以利用捕获(和非捕获)组:
import re
html = ('<a itemprop="url" href="//www.medicineindia.org/medicine-brand'
'-details/12220/cholstig"><span itemprop="name">CHOLSTIG</span></a><a '
'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
'/8414/capicare"><span itemprop="name">CAPICARE</span></a><a '
'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
'/230/cyclozobid"><span itemprop="name">CYCLOZOBID</span></a><a '
'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
'/6855/cinkona"><span itemprop="name">CINKONA</span></a>')
regex = '(?:href=")([^"]+)(?:.*?<span.*?>)(.*?)(?:</span>)'
matches = re.findall(regex, html)
for m in matches:
print(f'Brand: {m[1]}, URL: {m[0]}')
这将输出以下内容:
Brand: CHOLSTIG, URL: //www.medicineindia.org/medicine-brand-details/12220/cholstig
Brand: CAPICARE, URL: //www.medicineindia.org/medicine-brand-details/8414/capicare
Brand: CYCLOZOBID, URL: //www.medicineindia.org/medicine-brand-details/230/cyclozobid
Brand: CINKONA, URL: //www.medicineindia.org/medicine-brand-details/6855/cinkona
通过迭代元组列表matches
来格式化输出,其中链接与其对应的“跨度”内容匹配。