首先,对于python和网络抓取来说是非常新的东西。
我有一个页面需要剪贴。我查看了很多资源,但无法弄清嵌套隐藏标签的抓取。该页面需要登录并能够获取可见的数据,我的代码成功执行了这些可见的数据。但是,在废弃div
标记内的嵌套元素时,它什么也找不到。
HTML (在onClick()事件之前)
<div id="topMenu" style="width: 1920px; position: relative; top: 46px;" onclick="menu(event);" oncontextmenu="javascript:if(!event.ctrlKey){return RightClickPopUp(event);}">
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: hidden; display: none; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: hidden; display: none; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
</div>
单击div
(由多个按钮组成)后,第一个span
标签变为可见,然后跳转到其适当的嵌套span
标签中。我的问题是访问最里面span
中的文本。
HTML (在onClick()事件之后)
<div id="topMenu" style="width: 1920px; position: relative; top: 46px;" onclick="menu(event);" oncontextmenu="javascript:if(!event.ctrlKey){return RightClickPopUp(event);}">
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: visible; display: inline; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: visible; display: inline; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
</div>
Python代码
import mechanize
from bs4 import BeautifulSoup
import urllib
import http.cookiejar as cookielib
from bs4 import BeautifulSoup as soup
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("LOGIN_URL")
br.select_form(nr=0)
br.form['USER'] = 'un'
br.form['PASSWORD'] = 'pwd'
br.submit()
check = br.response().read()
print(check) //login success
my_url = br.open("URL_I_NEED_TO_SCRAPE").read()
page_soup = soup(my_url, "html.parser")
containers = page_soup.find("div",{"id":"topMenu"})
此代码可帮助我获取div
,但其中没有任何内容。有没有办法获取当前隐藏在该spans
中的div
?
答案 0 :(得分:0)
有很多方法可以提取内部隐藏元素,例如span,src和alt标签。
containers = page_soup.find("div",{"id":"topMenu"})
top_span=containers.find_all('span',class_='cSub')
print(len(top_span)
#len of spans is two
inner_span=top_span[0].find('span')
inner_span_text=inner_span.text
class_inside_inner_span=inner_span['class']
有关网络抓取的更多详细信息,请关注我的这篇文章:“ https://github.com/rajat4665/web-scraping-with-python”