使用动态ID抓取div id标记

时间:2017-01-02 07:01:43

标签: python web-crawler

我想用漂亮的汤来抓取网页内容。

但是,div id标签有动态ID。例如,在这种情况下,动态生成数字1。我该如何使用它?

我试过这个。

from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen(
    'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()

soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":"post_message"})
print letters

字母返回一个空列表。

2 个答案:

答案 0 :(得分:3)

您可以在attrs内使用正则表达式:

from bs4 import BeautifulSoup
import urllib
import re

r = urllib.urlopen(
    'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()

soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id": re.compile('post_message_\d+')})
print letters

答案 1 :(得分:2)

你可以试试这个。

from bs4 import BeautifulSoup
import urllib
import re


r = urllib.urlopen(
    'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()

soup = BeautifulSoup(r, "lxml")


letters = soup.find_all("div", attrs={"id": re.compile("^post_message_\d+")})
print letters