我正在使用Python中的BeautifulSoup抓取一个网站
我想找到所有a href
id
以“des”开头(尾部有空格)+'3-4个字母'
我刚试过:
bsObj.findAll("a",{"id":"des "})
但它找不到我原本想要的东西。
我需要使用正则表达式吗?
我很感激你的帮助。感谢。
<div>
<a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn">
11 BY BORIS BIDJAN SABERI
</a>
<br/>
<a id="des R6L" href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l">
11 ELEVEN
</a>
<br/>
</div>
答案 0 :(得分:4)
如果您使用正则表达式路由,您可以将已编译的正则表达式模式传递给id
参数,如此(为演示目的添加了不相关/不匹配的a
标记) :
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""<div><a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
lvrid=_gm_d6tn">11 BY BORIS BIDJAN SABERI</a><br /><a id="des R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><a id="ds R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><br />""")
soup.find_all('a', id=re.compile('^des \w{3,4}$'))
#[<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
# lvrid=_gm_d6tn" id="des 6TN">11 BY BORIS BIDJAN SABERI</a>, <a href="/en-
# kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">11 ELEVEN</a>]
答案 1 :(得分:1)
这是另一种方式(不使用正则表达式)我不喜欢正则表达式,我不需要它们。
all_des = soup.findAll('a')
#list of every <a> tag
for i in all_des: #loops through all
if i.has_attr('id') and i['id'].startswith('des'):
# check if there is an id within the <a> and if the id starts with des.
print(i)
输出:
<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn" id="des 6TN">
11 BY BORIS BIDJAN SABERI
</a>
<a href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">
11 ELEVEN
</a>
希望这能回答你的问题,真棒'@Psidom'上面的方法对你来说可能更方便,但我相信内置方法的蟒蛇比使用正则表达式更快。正则表达式'^des \w{3,4}$'
:
**^** asserts position at start of the string des matches the characters des literally (case sensitive)
**\w{3,4}** matches any word character (equal to [a-zA-Z0-9_])
**{3,4}** Quantifier — Matches between 3 and 4 times, as many times as possible, giving back as needed (greedy)
**$** asserts position at the end of the string