在BeautifulSoup中查找具有特定条件的id

时间:2017-08-24 12:41:31

标签: python regex beautifulsoup

我正在使用Python中的BeautifulSoup抓取一个网站

我想找到所有a href id以“des”开头(尾部有空格)+'3-4个字母'

我刚试过:

bsObj.findAll("a",{"id":"des "})

但它找不到我原本想要的东西。

我需要使用正则表达式吗?

我很感激你的帮助。感谢。

<div>
    <a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn">
        11 BY BORIS BIDJAN SABERI
    </a>
    <br/>
    <a id="des R6L" href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l">
        11 ELEVEN
    </a>
    <br/>
</div>

2 个答案:

答案 0 :(得分:4)

如果您使用正则表达式路由,您可以将已编译的正则表达式模式传递给id参数,如此(为演示目的添加了不相关/不匹配的a标记) :

from bs4 import BeautifulSoup
import re
​
soup = BeautifulSoup("""<div><a id="des 6TN" href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
lvrid=_gm_d6tn">11 BY BORIS BIDJAN SABERI</a><br /><a id="des R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><a id="ds R6L" href="/en-
kr/shop/men/11-eleven?lvrid=_gm_dr6l">11 ELEVEN</a><br />""")

soup.find_all('a', id=re.compile('^des \w{3,4}$'))

#[<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?
# lvrid=_gm_d6tn" id="des 6TN">11 BY BORIS BIDJAN SABERI</a>, <a href="/en-
# kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">11 ELEVEN</a>]

答案 1 :(得分:1)

这是另一种方式(不使用正则表达式)我不喜欢正则表达式,我不需要它们。

all_des = soup.findAll('a')
#list of every <a> tag

for i in all_des: #loops through all
    if i.has_attr('id') and i['id'].startswith('des'): 
# check if there is an id within the <a> and if the id starts with des.
        print(i)

输出:

<a href="/en-kr/shop/men/11-by-boris-bidjan-saberi?lvrid=_gm_d6tn" id="des 6TN">
        11 BY BORIS BIDJAN SABERI
    </a>
<a href="/en-kr/shop/men/11-eleven?lvrid=_gm_dr6l" id="des R6L">
        11 ELEVEN
    </a>

希望这能回答你的问题,真棒'@Psidom'上面的方法对你来说可能更方便,但我相信内置方法的蟒蛇比使用正则表达式更快。正则表达式'^des \w{3,4}$'

  

**^** asserts position at start of the string des matches the characters des literally (case sensitive)

     

**\w{3,4}** matches any word character (equal to [a-zA-Z0-9_])

     

**{3,4}** Quantifier — Matches between 3 and 4 times, as many times as possible, giving back as needed (greedy)

     

**$** asserts position at the end of the string