尝试将表数据解析为csv文件。有没有办法使用BeautifulSoup python在csv中连续解析动态生成的表数据

时间:2017-06-10 07:23:46

标签: python pandas selenium web-scraping beautifulsoup

我有一个名单列表,并尝试使用Xpath解析一行中的整个表格内容。在某些名称中,如果内容较少,我的webdriver被压缩,程序停止,所以我决定用pandas解析表。我做了我的研究,用pandas解析表到csv文件。但是不知道如何实现它。 这是我试图在csv

中连续解析的表的链接
DLLC , ACT , OREGON , 11-25-2015 , 11-25-2017 , PPB , PRINCIPAL PLACE OF BUSINESS , 22325 SW MURPHY ST,BEAVERTON , OR   and so on.

看到该表中的每个数据字段在每个单元格中的excel中都是这样的。我不想要任何标题。我只是将数据排成行。 现在我在csv中有这样的名字列表:

HALF MOON BEND FARM, LLC
NICELY GROWN LLC
COPR INCORPORATED
so on......

以下是代码:

from selenium import webdriver
from bs4 import BeautifulSoup
import lxml
import time
import csv

driver = webdriver.Chrome()
driver.get("url")
#time.sleep(5)

username = driver.find_element_by_name("p_name")
#time.sleep(1)
username.send_keys("xxxxxxx")
#username.clear()

driver.find_element_by_xpath("html/body/form/table[6]/tbody/tr/td[2]/input").click()

entity= driver.find_element_by_partial_link_text("xxxxxxx")


entity.click()
html = driver.page_source



Registry_nbr =       driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[1]").text
Entity_type = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[2]").text
Entity_status = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[3]").text
Registry_date = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[6]").text
#Next_renewal_date = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[6]").text
entity_name = driver.find_element_by_xpath("html/body/form/table[3]/tbody/tr/td[2]").text


Ttest=driver.find_element_by_xpath("html/body/form/table[32]/tbody/tr/td[2]").text


with open("sos.csv", "w") as scoreFile:
    scoreFileWriter = csv.writer(scoreFile)
  scoreFileWriter.writerow([Registry_nbr,Entity_type,Entity_status,Registry_date,entity_name],)

scoreFile.close()

soup =BeautifulSoup(html)



for tag in soup.find_all('table'):
    print tag.text   

2 个答案:

答案 0 :(得分:0)

一旦你有html,你可以使用BeautifulSoup解析它并找到你想要的表。查看您引用的HTML页面,我没有看到要搜索的任何classid或标识键,因此只需要在表[2]中编入索引即可。

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

NBSP = u'\xa0'
tables = [ [ map(lambda d: d.text.replace(NBSP, u''), r.findAll('td'))
           for r in t.findAll('tr') ]
         for t in soup.findAll('table') ]

business_entity_data = tables[2]
keys = business_entity_data[0]

with open('page.csv', 'wb') as csvfile:
    csvwriter = csv.DictWriter(csvfile, keys)
    csvwriter.writeheader()
    csvwriter.writerow(dict(zip(keys, business_entity_data[1])))

您应该得到一个包含以下内容的文件:

Registry Nbr,Entity Type,Entity Status,Jurisdiction,Registry Date,Next Renewal Date,Renewal Due?
1164570-94,DLLC,ACT,OREGON,11-25-2015,11-25-2017,

答案 1 :(得分:0)

Use this after entity.click()

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
words = soup.find_all("td")
table_data =  soup.get_text().encode('utf-8')

word = list()
for cell in words:
        a.append((cell.text).encode('utf-8'))

with open('name.csv', 'w') as csvfile:
    spamwriter = csv.writer(csvfile,delimiter=',')
    spamwriter.writerow(word)

hope this will help