将Python文本表转换为Pandas DF,然后转换为CSV

时间:2019-12-09 13:00:48

标签: python dataframe export-to-csv

[![在此处输入图像描述] [1]] [1] [![在此处输入图像描述] [1]] [1]通过以下代码将HTML Web表打印为python脚本输出。 然后我尝试将其转换为Pandas DF,然后导出为CSV,但失败

import time
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

url = 'http://www.altrankarlstad.com/wisp'

driver = webdriver.Chrome('C:\\Users\\rugupta\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Python 3.7\\chromedriver.exe')

driver.get(url)
time.sleep(100) 

text_field = driver.find_elements_by_xpath('//*[@id="root"]/div/div/div/div[2]/table')
#print (text_field[0].text)
data= text_field[0].text
#Works fine until above section

df= pd.DataFrame(data)
df.to_csv("output.csv")
(but no success here)!

[![enter image description here][1]][1]


  [1]: https://i.stack.imgur.com/NpGk2.jpg

1 个答案:

答案 0 :(得分:0)

问题在于selenium将检测到页面已加载,但是,您需要它等待直到包含要抓取的数据的表已加载。因此,您需要告诉selenium等待,直到在表中找到一个元素。对于这种特殊情况,表中的每个“作业”都由一个名为“ css-58'的特定类名定义。解决方案如下:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import time
import pandas as pd


url = 'http://www.altrankarlstad.com/wisp'

driver = webdriver.Chrome("C:\\driver path")
driver.get(url)

# delay is how long to wait on loading the page before it gives up
delay = 600

try:
    wait_for_element = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'css-58')))
    text_field = driver.find_elements_by_xpath('//*[@id="root"]/div/div/div/div[2]/table')
    data= text_field[0].text

    # Create your dataframe here
    # This will currently fail due to the error
    # ValueError: DataFrame constructor not properly called!
    # You should be able to define the structure of your data frame to suit your needs
    df= pd.DataFrame(data)
    df.to_csv("output.csv")
except TimeoutException:
    print('It took too long')

此时,您只需要做的就是弄清楚如何定义数据框的结构。