我有一个名单列表,并尝试使用Xpath
解析一行中的整个表格内容。在某些名称中,如果内容较少,我的webdriver
被压缩,程序停止,所以我决定用pandas
解析表。我做了我的研究,用pandas解析表到csv文件。但是不知道如何实现它。
这是我试图在csv
DLLC , ACT , OREGON , 11-25-2015 , 11-25-2017 , PPB , PRINCIPAL PLACE OF BUSINESS , 22325 SW MURPHY ST,BEAVERTON , OR and so on.
看到该表中的每个数据字段在每个单元格中的excel中都是这样的。我不想要任何标题。我只是将数据排成行。 现在我在csv中有这样的名字列表:
HALF MOON BEND FARM, LLC
NICELY GROWN LLC
COPR INCORPORATED
so on......
以下是代码:
from selenium import webdriver
from bs4 import BeautifulSoup
import lxml
import time
import csv
driver = webdriver.Chrome()
driver.get("url")
#time.sleep(5)
username = driver.find_element_by_name("p_name")
#time.sleep(1)
username.send_keys("xxxxxxx")
#username.clear()
driver.find_element_by_xpath("html/body/form/table[6]/tbody/tr/td[2]/input").click()
entity= driver.find_element_by_partial_link_text("xxxxxxx")
entity.click()
html = driver.page_source
Registry_nbr = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[1]").text
Entity_type = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[2]").text
Entity_status = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[3]").text
Registry_date = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[6]").text
#Next_renewal_date = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[6]").text
entity_name = driver.find_element_by_xpath("html/body/form/table[3]/tbody/tr/td[2]").text
Ttest=driver.find_element_by_xpath("html/body/form/table[32]/tbody/tr/td[2]").text
with open("sos.csv", "w") as scoreFile:
scoreFileWriter = csv.writer(scoreFile)
scoreFileWriter.writerow([Registry_nbr,Entity_type,Entity_status,Registry_date,entity_name],)
scoreFile.close()
soup =BeautifulSoup(html)
for tag in soup.find_all('table'):
print tag.text
答案 0 :(得分:0)
一旦你有html,你可以使用BeautifulSoup解析它并找到你想要的表。查看您引用的HTML页面,我没有看到要搜索的任何classid或标识键,因此只需要在表[2]中编入索引即可。
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
NBSP = u'\xa0'
tables = [ [ map(lambda d: d.text.replace(NBSP, u''), r.findAll('td'))
for r in t.findAll('tr') ]
for t in soup.findAll('table') ]
business_entity_data = tables[2]
keys = business_entity_data[0]
with open('page.csv', 'wb') as csvfile:
csvwriter = csv.DictWriter(csvfile, keys)
csvwriter.writeheader()
csvwriter.writerow(dict(zip(keys, business_entity_data[1])))
您应该得到一个包含以下内容的文件:
Registry Nbr,Entity Type,Entity Status,Jurisdiction,Registry Date,Next Renewal Date,Renewal Due?
1164570-94,DLLC,ACT,OREGON,11-25-2015,11-25-2017,
答案 1 :(得分:0)
Use this after entity.click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
words = soup.find_all("td")
table_data = soup.get_text().encode('utf-8')
word = list()
for cell in words:
a.append((cell.text).encode('utf-8'))
with open('name.csv', 'w') as csvfile:
spamwriter = csv.writer(csvfile,delimiter=',')
spamwriter.writerow(word)
hope this will help