我正在尝试使用Python正则表达式从下面的HTML标记获取id='revSAR'
的所有网址:
<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
See all 136 customer reviews
</a>
我尝试了下面的代码,但它不起作用(它什么都不打印):
regex = b'<a id="revSAR" href="(.+?)" class="txtsmall noTextDecoration">(.+?)</a>'
pattern=re.compile(regex)
rev_url=re.findall(pattern,txt)
print ('reviews url: ' + str(rev_url))
答案 0 :(得分:1)
您可以尝试类似
的内容(_, url), = re.findall(r'href=([\'"]*)(\S+)\1', input)
print url
但是,我个人宁愿使用像BeautifulSoup之类的HTML解析库来完成这样的任务。
答案 1 :(得分:0)
您无需匹配id=...
,href=...
等不必要的部分,请尝试以下方法:
regex = 'http://.*\'\s+'
答案 2 :(得分:0)
首先,为什么你的正则表达式不起作用?在您的html中,使用单引号引用属性,其中在regex中引用双引号。而你只需要关心href属性。尝试使用href=['"](.+?)['"]
作为正则表达式,并且如果使用忽略大小写切换
但是再次使用正则表达式解析html是一个非常糟糕的决定。请浏览this
答案 3 :(得分:0)
此exprssion将:
revSAR
<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)revSAR\1(?:\s|>))
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=(['"]?)(.*?)\2(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(.*?)\s*<\/a>
现场演示
示例文字
注意这里的第一对锚标签有一些非常困难的边缘情况。
<a onmouseover=' id="revSAR" ; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
You shouldn't find me
</a>
<a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
See all 111 customer reviews
</a>
<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
See all 136 customer reviews
</a>
<强>匹配强>
组0获取整个锚标记
第1组获取围绕id属性的引用,稍后使用该引用来查找正确的结束引用
第2组获取围绕href属性的引用,稍后使用该引用来查找正确的结束引用
第3组获取href属性值,不包括任何引号
第4组获取内部文本,不包括任何周围的空格
[0][0] = <a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
See all 111 customer reviews
</a>
[0][1] = '
[0][2] = '
[0][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[0][4] = See all 111 customer reviews
[1][0] = <a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
See all 136 customer reviews
</a>
[1][1] = '
[1][2] = '
[1][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[1][4] = See all 136 customer reviews