我想抓取一个HTML表,其中包含<div class="...">
格式的元素。要刮掉它,我想我需要使用:
if found driver.find_element_by_xpath contains(footable-row-detail-name)
get value from /following-sibling which is (class="footable-row-detail-value")
这只是一张桌子。我要抓取的网站上有很多表,有些表没有所有数据(这就是“如果找到”的原因)
我想为此使用python 3。 我希望我能解释清楚。一个表的HTML代码:
<div class="footable-row-detail-inner">
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
197. Omeopatia, 202. Linfodrenaggio manuale, 205. Massaggio classico, 664. Riflessoterapia generale
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cognome:
</div>
<div class="footable-row-detail-value">
ABBONDANZIERI Katia
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Via:
</div>
<div class="footable-row-detail-value">
Place du Cirque, 2
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
NPA:
</div>
<div class="footable-row-detail-value">
1204
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Luogo:
</div>
<div class="footable-row-detail-value">
Genève
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Tel / Cellulare:
</div>
<div class="footable-row-detail-value">
022 328 23 44
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cellulare:
</div>
<div class="footable-row-detail-value">
079 601 92 75
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
<div class="thZone">
<div class="zCat">
METHODES DE MASSAGE
</div>
<div class="zThr">
Linfodrenaggio manuale
</div>
<div class="zThr">
Massaggio classico
</div>
<div class="zCat">
METHODES PRESCRIPTIVES
</div>
<div class="zThr">
Omeopatia
</div>
<div class="zCat">
METHODES REFLEXES
</div>
<div class="zThr">
Riflessoterapia generale
</div>
</div>
</div>
</div>
感谢您的帮助。
答案 0 :(得分:0)
使用python3的一种解决方案是html.parser模块!
有一个简单的示例可以帮助您入门:)
答案 1 :(得分:0)
这为我运行。我正在使用jupyter并逐行运行此代码。尚未加载元素时,您可能会遇到错误,因此,如果发生错误,请进行调整。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
driver = webdriver.Chrome()
driver.get("http://asca.ch/Partners.aspx?lang=it")
cantone = driver.find_element_by_xpath("""//*[@id="ctl00_MainContent_ddl_cantons_Input"]""")
cantone.click()
cantone.send_keys('GE')
cantone.send_keys(Keys.ENTER)
confermo = driver.find_element_by_xpath("""//*[@id="MainContent__chkDisclaimer"]""")
confermo.click()
ricera = driver.find_element_by_xpath("""//*[@id="MainContent_btn_submit"]""")
ricera.click()
toggle = driver.find_elements_by_class_name("""footable-toggle""")
print(toggle)
while not toggle:
time.sleep(.2)
toggle = driver.find_elements_by_class_name("""footable-toggle""")
for r in toggle:
time.sleep(.2)
r.click()
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
while not data:
time.sleep(.2)
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
list_df = []
for r in data:
ratum = r.get_attribute('innerHTML')
datum = r.get_attribute('innerHTML')\
.replace("""<div class="footable-row-detail-inner">""","<table>")\
.replace("""<div class="footable-row-detail-row">""","<tr>")\
.replace("""<div class="footable-row-detail-name">""","<td>")\
.replace("""<div class="footable-row-detail-value">""","</td><td>")
list_df.append(dict(pd.read_html(datum)[0].values.tolist()))
df = pd.DataFrame(list_df)
df.to_csv('data.csv')
print(df)