我想用漂亮的汤来抓取网页内容。
但是,div id标签有动态ID。例如,在这种情况下,动态生成数字1。我该如何使用它?
我试过这个。
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":"post_message"})
print letters
字母返回一个空列表。
答案 0 :(得分:3)
您可以在attrs
内使用正则表达式:
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id": re.compile('post_message_\d+')})
print letters
答案 1 :(得分:2)
你可以试试这个。
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id": re.compile("^post_message_\d+")})
print letters