我在做硒机器人,我需要在搜索后从页面中提取信息,但是遇到了麻烦。
下图中有HTML
我想从“ class ='escapamentoLinhas'”的这些斜体标签中提取文本
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
from selenium.webdriver.support.ui import Select
URL = '\x0\x0\x0\x0'
search = '\x0\x0'
print("Running...")
class ScrapingTJ:
def __init__(self):
self.browser = webdriver.Firefox()
self.browser.get(URL)
sleep(1)
select = Select(self.browser.find_element_by_id('cbPesquisa'))
select.select_by_value('NMPARTE')
sleep(1)
self.browser.find_element_by_xpath('//*[@id="campo_NMPARTE"]').send_keys(CNPJ_CLARO)
self.browser.find_element_by_id('pbEnviar').click()
sleep(2)
dados = self.browser.find_element_by_id('listagemDeProcessos')
HTML = dados.get_attribute("innerHTML")
scraping = BeautifulSoup(HTML, "html.parser")
# links
links = scraping.find_all('a')
for scrap in links:
print(scrap.get_text())
textos = scraping.find(class_ = 'espacamentoLinhas')
subtextos = scraping.find_all('span')
for ext in subtextos:
print(ext.get_text())
if __name__ == '__main__':
ScrapingTJ()
退出:
Exectdo:
Recebido em:
Exectda:
Recebido em:
在:
中,我应该得到'2007年4月30日-图片中的下划线是Vara dasExecuçõesFiscais Estaduais'
答案 0 :(得分:0)
根据您提供的HTML图像,看起来元素文本位于div
元素中,而不是span
中。您将需要从div
中提取文本,而不是span
。我将替换此块:
textos = scraping.find(class_ = 'espacamentoLinhas')
subtextos = scraping.find_all('span')
for ext in subtextos:
print(ext.get_text())
与此:
elements = self.browser.find_elements_by_xpath("//div[@class='espacamentoLinhas']")
for element in elements:
print(element.text)
span
仅包含文本'Recebido em:',而不包含您要查找的文本30/04/2007 - Vara das Execuções Fiscais Estaduais
。该文本实际上包含在我所包含的XPath中引用的div
中。
如果您不想使用self.browser.find_elements_by_xpath
,则可以只删除代码中的scraping.find_all('span')
部分:
textos = scraping.find(class_ = 'espacamentoLinhas')
for ext in textos:
print(ext.get_text())