使用selenium从某些“html元素”解析文本

时间:2017-10-13 12:47:10

标签: selenium selenium-webdriver web-scraping

到目前为止我看到的是网页的页面来源,如果通过selenium过滤,则可以解析文本或应用bs4或lxml的页面源所需的内容,无论页面源是否启用了javascript 。但是,我的问题是如何通过过滤selenium然后使用bs4或lxml库来解析来自某个html elements的文档。如果考虑下面粘贴的元素,那么按照我移动的方式应用bs4或lxml:

html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
        <td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
#rest of the code here

from lxml.html import fromstring
tree = fromstring(html)           
#rest of the code here

现在,如何使用selenium过滤上述粘贴html部分,然后在其上应用bs4库?无法想到driver.page_source,因为它仅在从网页过滤时才适用。

更具体一点,如果我想使用下面的东西,那怎么可能呢?

from selenium import webdriver
driver = webdriver.Chrome()

element_html = driver-------(html)  #this "html" is the above pasted one
print(element_html)

1 个答案:

答案 0 :(得分:1)

driver.page_source会在特定时刻为您提供该页面的完整HTML源代码。但是,您有一个元素实例,可以使用outerHTML方法获取.get_attribute()

element = driver.find_element_by_id("some_id")
element_html = element.get_attribute("outerHTML")

soup = BeautifulSoup(element_html, "lxml")

至于从span属性中提取mouseover元素来源 - 我首先使用tr解析BeautifulSoup元素,然后获取{{1} }属性然后使用正则表达式从onmouseover函数调用中提取html值。然后,使用Tip()

重新解析span html
BeautifulSoup

打印:

import re

from bs4 import BeautifulSoup

html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
        <td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''

soup = BeautifulSoup(html, "lxml")
mouse_over = soup.tr['onmouseover']

span = re.search(r"Tip\('(.*?)'\)", mouse_over).group(1)
span_soup = BeautifulSoup(span, "lxml")
print(span_soup.get_text())