[![在此处输入图像描述] [1]] [1] [![在此处输入图像描述] [1]] [1]通过以下代码将HTML Web表打印为python脚本输出。 然后我尝试将其转换为Pandas DF,然后导出为CSV,但失败
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome('C:\\Users\\rugupta\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Python 3.7\\chromedriver.exe')
driver.get(url)
time.sleep(100)
text_field = driver.find_elements_by_xpath('//*[@id="root"]/div/div/div/div[2]/table')
#print (text_field[0].text)
data= text_field[0].text
#Works fine until above section
df= pd.DataFrame(data)
df.to_csv("output.csv")
(but no success here)!
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/NpGk2.jpg
答案 0 :(得分:0)
问题在于selenium将检测到页面已加载,但是,您需要它等待直到包含要抓取的数据的表已加载。因此,您需要告诉selenium等待,直到在表中找到一个元素。对于这种特殊情况,表中的每个“作业”都由一个名为“ css-58'的特定类名定义。解决方案如下:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import time
import pandas as pd
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome("C:\\driver path")
driver.get(url)
# delay is how long to wait on loading the page before it gives up
delay = 600
try:
wait_for_element = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'css-58')))
text_field = driver.find_elements_by_xpath('//*[@id="root"]/div/div/div/div[2]/table')
data= text_field[0].text
# Create your dataframe here
# This will currently fail due to the error
# ValueError: DataFrame constructor not properly called!
# You should be able to define the structure of your data frame to suit your needs
df= pd.DataFrame(data)
df.to_csv("output.csv")
except TimeoutException:
print('It took too long')
此时,您只需要做的就是弄清楚如何定义数据框的结构。