使用 beautifulsoup 和 selenium 抓取多页网站返回空字符串列表

时间:2021-07-29 09:36:44

标签: html python-3.x selenium beautifulsoup slice

我想从网站上反复抓取文本。该网页的每个页面都具有相同的 html 结构。 每次附加以下字符串时,我都会使用 selenium 导航到下一页:text_i_want1text_i_wantAtext_i_wantBtext_i_wantC

[<div class="col-12">
                            <a href="/url" target="_blank" title="ad i">
                                text_i_want1
                            </a>
                        </div>, 
                  <div class="col-12">
                            <div class="row">
                                <div>
                                    date: text_i_wantA
                                </div>
                            </div> 

                            
                                <div class="row">
                                    <div>
                                        source: text_i_wantB
                                    </div>
                                </div>
                             
                
                            <div class="row">
                                <div>
                                    number: text_i_wantC
                                    
                                    <span class="processlink">
                                        <a href="url" title="text_i_dont_want">
                                            text_i_dont_want
                                        </a>
                                    </span>
                                    
                                </div>
                                  
                            </div>

                            
                                
                        </div>,
                 <div class="col-12">
                            <a href="/url" target="_blank" title="ad i">
                                text_i_want2
                            </a>
                        </div>, 
                 <div class="col-12">
                            <div class="row">
                                <div>
                                    date: text_i_wantAA
                                </div>
                            </div> 

                            
                                <div class="row">
                                    <div>
                                        source: text_i_wantBB
                                    </div>
                                </div>
                             
                
                            <div class="row">
                                <div>
                                    number: text_i_wantCC
                                    
                                    <span class="processlink">
                                        <a href="/url" title="text_i_dont_want">
                                            text_i_dont_want
                                        </a>
                                    </span>
                                    
                                </div>
                                  
                            </div>

                            
                                
                        </div>,
                  <div class="col-12">
                            <a href="/url" target="_blank" title="ad i">
                                text_i_want3
                            </a>
                        </div>, 
                  <div class="col-12">
                            <div class="row">
                                <div>
                                    date: text_i_wantAAA
                                </div>
                            </div> 

                            
                                <div class="row">
                                    <div>
                                        source: text_i_wantBBB
                                    </div>
                                </div>
                             
                
                            <div class="row">
                                <div>
                                    number: text_i_wantCCC
                                    
                                    <span class="processlink">
                                        <a href="/url" title="text_i_dont_want">
                                            text_i_dont_want
                                        </a>
                                    </span>
                                    
                                </div>
                                  
                            </div>

                            
                                
                        </div>, 
                 <div class="col-12">
                            .  
                            . 
                            . 
                            . 
                        </div>]

因为text_i_want1divtext_i_wantAtext_i_wantB不在同一个text_i_wantC中,所以我使用beautifulsoup来获取所有<div class="col-12"> 但是将输出切片 [1::2] 以便仅在每秒 <div class="col-12"> 上迭代以获得 text_i_wantAtext_i_wantBtext_i_wantC。 为便于阅读,下面我只包含了每页 20 <div class="col-12"> 中其他结构相同的三个。

title,date,name,number = [],[],[],[]
while True:
    soup = bs(driver.page_source, 'html5lib')
    for div in soup.find_all('a', attrs={'title':'ad i'}):
        titl = div.get_text(strip=True)
        title.append(titl)
    else:
        break
    for col in soup.find_all('div', attrs={'class':'col-12'})[1::2]:
        row = []
        for entry in col.select('div.row div'):
            target = entry.find_all(text=True, recursive=False)
            row.append(target[0].strip())
        name.append(row[0])
        date.append(row[1])
        number.append(row[2])  

    next_btn = driver.find_elements_by_css_selector(".page-next button")
    if next_btn:
        actions = ActionChains(driver)
        actions.move_to_element(next_btn[0]).click().perform()
        time.sleep(4)
    else:
        break
driver.close()

预期输出:

title = ["text_i_want1", "text_i_want2", ...]

date = ["text_i_wantA", "text_i_wantAA", ...]

name = ["text_i_wantB", "text_i_wantBB", ...]

number = ["text_i_wantC", "text_i_wantCC", ...]

问题:实际输出

title = ["text_i_want1", "text_i_want2", ...]

date = ['text_i_wantA', 'text_i_wantAA', ...]

name = ['', '', '', '', '', '', '', '', '', '']

number = ['', '', '', '', '', '', '', '', '', '']

为什么 namenumber 是空的,在 html 中有字符值。是css的问题还是循环本身的问题?

..................................... ………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………………………………………………………………… ....

更新问题:集成

DRIVER_PATH = 'chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
prefs = {"profile.default_content_settings.popups": 0,
         "download.default_directory": r"C:\Users\aaa",
         "directory_upgrade": True,
         "plugins.always_open_pdf_externally": True}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('https://parldok.thueringen.de/ParlDok/formalkriterien')
driver.maximize_window()
try:
    selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
    driver.execute_script("document.getElementById('LegislaturperiodenList').style.display='inline-block';")
    element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList')))
    selenium.webdriver.support.ui.Select(element).select_by_value('7')
except Exception as ex:
    print(ex)

try:
    selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
    driver.execute_script("document.getElementById('DokumententypId').style.display='inline-block';")
    element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'DokumententypId')))
    selenium.webdriver.support.ui.Select(element).select_by_value('10')
except Exception as ex:
    print(ex)
driver.find_element_by_css_selector('button[class="btn btn-primary"][type="submit"]').click()

这就是我设置 selenium 以便能够导航到下一页的方式。你能帮我把东西放在一起吗?我不知道如何将您的方法与硒结合起来。

1 个答案:

答案 0 :(得分:0)

<块引用>

更新答案

import requests
from bs4 import BeautifulSoup
import pandas as pd
from math import ceil


allin = []


def parser(soup):
    goal = (
        (
            x.select('a')[1].get_text(strip=True),
            *list(x.select('.col-12')[1].stripped_strings)[:-1]
        )
        for x in soup.select('.row.tlt_search_result'))
    allin.append(pd.DataFrame(goal))


def main(url):
    with requests.Session() as req:
        data = {
            "LegislaturPeriodenNummer": "7",
            "UrheberPersonenId": "",
            "UrheberSonstigeId": "",
            "DokumententypId": "10",
            "BeratungsstandId": "",
            "Datum": "",
            "DatumVon": "",
            "DatumBis": ""
        }
        r = req.post(url, data=data)
        soup = BeautifulSoup(r.text, 'lxml')
        print("Extracting Page# 1")
        parser(soup)

        try:
            nextpage = int(soup.select_one(
                '.pd_resultcount').contents[0].split()[-1]) / 10

            for page in range(2, ceil(nextpage) + 1):
                print(f"Extracting Page# {page}")
                r = req.get(f"{url}/{page}")
                soup = BeautifulSoup(r.text, 'lxml')
                parser(soup)
        except AttributeError:
            print('No More Result Found!')


if __name__ == "__main__":
    main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')
    final = pd.concat(allin, ignore_index=True)
    print(final)
    final.to_csv('data.csv', index=False)

输出:

                                                      0  ...                       3
0     GRW-Fördermittelanträge eines Fertigteil-Herst...  ...  Dokumentnummer: 7/2303
1     Vertretung der Menschen mit Behinderungen in T...  ...  Dokumentnummer: 7/2307
2     Rassistische und rechtsextremistische Aktivitä...  ...  Dokumentnummer: 7/2306
3     Antisemitische Überfälle, Leugnung des Holocau...  ...  Dokumentnummer: 7/2302
4     Finanzierung von Kindertagesstätten in Thüring...  ...  Dokumentnummer: 7/2301
...                                                 ...  ...                     ...
2299               NaturFreunde Thüringen e.V. - Teil I  ...     Dokumentnummer: 7/6
2300  Aktuelle Sicherheitslage für Thüringer Kunst- ...  ...     Dokumentnummer: 7/5
2301  Stand der Planungen zur Ortsumgehung der Stadt...  ...     Dokumentnummer: 7/3
2302  Übergangsbestimmungen zur Neuordnung der Organ...  ...     Dokumentnummer: 7/2
2303  Baustellen entlang der Autobahn 71 zwischen de...  ...     Dokumentnummer: 7/1

[2304 rows x 4 columns]
import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(url):
    data = {
        "LegislaturPeriodenNummer": "7",
        "UrheberPersonenId": "",
        "UrheberSonstigeId": "",
        "DokumententypId": "10",
        "BeratungsstandId": "",
        "Datum": "",
        "DatumVon": "",
        "DatumBis": ""
    }
    r = requests.post(url, data=data)
    soup = BeautifulSoup(r.text, 'lxml')
    goal = (
        (
            x.select('a')[1].get_text(strip=True),
            *list(x.select('.col-12')[1].stripped_strings)[:-1]
        )
        for x in soup.select('.row.tlt_search_result'))
    df = pd.DataFrame(goal)
    print(df)


main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')

输出:

                                                   0  ...                       3
0  GRW-Fördermittelanträge eines Fertigteil-Herst...  ...  Dokumentnummer: 7/2303
1  Vertretung der Menschen mit Behinderungen in T...  ...  Dokumentnummer: 7/2307
2  Rassistische und rechtsextremistische Aktivitä...  ...  Dokumentnummer: 7/2306
3  Antisemitische Überfälle, Leugnung des Holocau...  ...  Dokumentnummer: 7/2302
4  Finanzierung von Kindertagesstätten in Thüring...  ...  Dokumentnummer: 7/2301
5        Ausstattung der unteren Naturschutzbehörden  ...  Dokumentnummer: 7/2300
6  Antifa-Szene, insbesondere das Arnstädter "Akt...  ...  Dokumentnummer: 7/2291
7  Finanzierung der Beschaffung von Ausrüstung, A...  ...  Dokumentnummer: 7/2309
8                       Statistik der Kfz-Diebstähle  ...  Dokumentnummer: 7/2308
9  Unterstützung des Freistaats Thüringen für Sta...  ...  Dokumentnummer: 7/2299

[10 rows x 4 columns]