如果找到数据,则通过同级使用div类删除表

时间:2019-01-24 11:57:51

标签: python web-scraping

我想抓取一个HTML表,其中包含<div class="...">格式的元素。要刮掉它,我想我需要使用:

if found driver.find_element_by_xpath contains(footable-row-detail-name)
get value from /following-sibling which is (class="footable-row-detail-value")

这只是一张桌子。我要抓取的网站上有很多表,有些表没有所有数据(这就是“如果找到”的原因)

我想为此使用python 3。 我希望我能解释清楚。一个表的HTML代码:

<div class="footable-row-detail-inner">
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        Discipline(s) thérapeutique(s):
    </div>
    <div class="footable-row-detail-value">
        197. Omeopatia, 202. Linfodrenaggio manuale, 205. Massaggio classico, 664. Riflessoterapia generale
    </div>
</div>
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        Cognome:
    </div>
    <div class="footable-row-detail-value">
        ABBONDANZIERI Katia
    </div>
</div>
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        Via:
    </div>
    <div class="footable-row-detail-value">
        Place du Cirque, 2
    </div>
</div>
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        NPA:
    </div>
    <div class="footable-row-detail-value">
        1204
    </div>
</div>
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        Luogo:
    </div>
    <div class="footable-row-detail-value">
        Genève
    </div>
</div>
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        Tel / Cellulare:
    </div>
    <div class="footable-row-detail-value">
        022 328 23 44
    </div>
</div>
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        Cellulare:
    </div>
    <div class="footable-row-detail-value">
        079 601 92 75
    </div>
</div>
<div class="footable-row-detail-row">
    <div class="footable-row-detail-name">
        Discipline(s) thérapeutique(s):
    </div>
    <div class="footable-row-detail-value">
        <div class="thZone">
            <div class="zCat">
                METHODES DE MASSAGE
            </div>
            <div class="zThr">
                Linfodrenaggio manuale
            </div>
            <div class="zThr">
                Massaggio classico
            </div>
            <div class="zCat">
                METHODES PRESCRIPTIVES
            </div>
            <div class="zThr">
                Omeopatia
            </div>
            <div class="zCat">
                METHODES REFLEXES
            </div>
            <div class="zThr">
                Riflessoterapia generale
            </div>
        </div>
    </div>
</div>

感谢您的帮助。

2 个答案:

答案 0 :(得分:0)

使用python3的一种解决方案是html.parser模块!

有一个简单的示例可以帮助您入门:)

答案 1 :(得分:0)

这为我运行。我正在使用jupyter并逐行运行此代码。尚未加载元素时,您可能会遇到错误,因此,如果发生错误,请进行调整。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd


driver = webdriver.Chrome()

driver.get("http://asca.ch/Partners.aspx?lang=it")

cantone = driver.find_element_by_xpath("""//*[@id="ctl00_MainContent_ddl_cantons_Input"]""")

cantone.click()

cantone.send_keys('GE')

cantone.send_keys(Keys.ENTER)

confermo = driver.find_element_by_xpath("""//*[@id="MainContent__chkDisclaimer"]""")

confermo.click()

ricera = driver.find_element_by_xpath("""//*[@id="MainContent_btn_submit"]""")

ricera.click()

toggle = driver.find_elements_by_class_name("""footable-toggle""")
print(toggle)
while not toggle:
    time.sleep(.2)
    toggle = driver.find_elements_by_class_name("""footable-toggle""")

for r in toggle:
    time.sleep(.2)
    r.click()

data = driver.find_elements_by_class_name("""footable-row-detail-cell""")

while not data:
    time.sleep(.2)
    data = driver.find_elements_by_class_name("""footable-row-detail-cell""")

list_df = []
for r in data:
    ratum = r.get_attribute('innerHTML')
    datum = r.get_attribute('innerHTML')\
        .replace("""<div class="footable-row-detail-inner">""","<table>")\
        .replace("""<div class="footable-row-detail-row">""","<tr>")\
        .replace("""<div class="footable-row-detail-name">""","<td>")\
        .replace("""<div class="footable-row-detail-value">""","</td><td>")
    list_df.append(dict(pd.read_html(datum)[0].values.tolist()))

df = pd.DataFrame(list_df)
df.to_csv('data.csv')
print(df)