尝试使用硒/美丽汤汁提取动态表(URL不变)

时间:2018-06-27 23:37:11

标签: python selenium selenium-webdriver web-scraping beautifulsoup

我一直在尝试提取以下表格,这些表格是我使用chromedriver自动输入后通过反验证码服务获得的,我看到了一个示例,该示例在表生成后有人使用了漂亮的汤。

这是一个多页表,但是我只是想先获得第一页,然后再试图找出如何单击其他页,我不确定是否可以使用漂亮的汤,因为当我尝试代码时下面,我得到第一行“没有要显示的属性”。如果没有搜索结果,则存在。

由于我的排名不够高,我无法在此处嵌入图片(对不起,我是这个新手,很烦人,我试图在发布数小时之前就弄清楚了),但是如果您访问该网站,搜索“ Al”或任何输入,您都可以看到表格html https://claimittexas.org/app/claim-search

这是我的代码-

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from python_anticaptcha import AnticaptchaClient, NoCaptchaTaskProxylessTask
import re
import pandas as pd
import os
import time
import requests

parsed_table_date = []
url = "https://claimittexas.org/app/claim-search"
driver = webdriver.Chrome()
driver.implicitly_wait(15)
driver.get(url)
lastNameField = driver.find_element_by_xpath('//input[@id="lastName"]')
lastNameField.send_keys('Al')
api_key = #MY API key
site_key = '6LeQLyEUAAAAAKTwLC-xVC0wGDFIqPg1q3Ofam5M'  # grab from site
client = AnticaptchaClient(api_key)
task = NoCaptchaTaskProxylessTask(url, site_key)
job = client.createTask(task)
print("Waiting to solution by Anticaptcha workers")
job.join()
# Receive response
response = job.get_solution_response()
print("Received solution", response)
# Inject response in webpage
driver.execute_script('document.getElementById("g-recaptcha-response").innerHTML = "%s"' % response)
# Wait a moment to execute the script (just in case).
time.sleep(1)
# Press submit button
driver.find_element_by_xpath('//button[@type="submit" and @class="btn-std"]').click()
time.sleep(1)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
table = soup.find("table", { "class" : "claim-property-list" })
table_body = table.find('tbody')
#rows = table_body.find_all('tr')
for row in table_body.findAll('tr'):
    print(row)
    for col in row.findAll('td'):
        print(col.text.strip())

1 个答案:

答案 0 :(得分:1)

由于以下原因,您得到No properties to display.

img

相反,您必须从元素的第二个索引进行迭代:

//tbody/tr[2]/td[2]
//tbody/tr[2]/td[3]
//tbody/tr[2]/td[4]
...
//tbody/tr[3]/td[2]
//tbody/tr[3]/td[3]
//tbody/tr[3]/td[4]
...

因此,您必须像这样从迭代中指定起始索引:

rows = driver.find_elements_by_xpath("//tbody/tr")
for row in rows[1:]:
    print(row.text) # prints the whole row
    for col in row.find_elements_by_xpath('td')[1:]:
        print(col.text.strip())

上面的代码具有以下输出:

CLAIM # this is button value
37769557 1ST TEXAS LANDSCAPIN 6522 JASMINE ARBOR LANE HOUSTON TX 77088 MOTEL 6 OPERATING LP ACCOUNTS PAYABLE $351.00 2010
37769557
1ST TEXAS LANDSCAPIN
6522 JASMINE ARBOR LANE
HOUSTON
TX
77088
MOTEL 6 OPERATING LP
ACCOUNTS PAYABLE
$351.00
2010
CLAIM # this is button value
38255919 24X7 APARTMENT FIND OF TEXAS 1818 MOSTON DR SPRING TX 77386 NOT DISCLOSED NOT DISCLOSED $88.76 2017
38255919
24X7 APARTMENT FIND OF TEXAS
1818 MOSTON DR
SPRING
...