如何使用bs4查找特定标题下的复选框?

时间:2017-06-24 12:57:18

标签: python python-3.x web-scraping beautifulsoup

我试图抓取复选框(或所有信息),针对以下url的某些问题。

示例我想在" 01.1标题下找到信息。选择最能代表您主要活动的类别。" 如果我不存在,我想要一个空白区域。

这是我目前的代码:

from splinter import *
import bs4 as bs
import os
import time
import csv
from selenium.common.exceptions import ElementNotVisibleException

path = os.getcwd()+'/chromedriver.exe'
executable_path = {'executable_path': path}
browser = Browser('chrome', **executable_path)

urls = ['https://www.unpri.org/organisation/folksam-143819']

for i in urls:
    browser.visit(i)
    window = browser.windows[0]
    window.is_current = True
    temp_list = []
    sourcenew = browser.html
    soupnew = bs.BeautifulSoup(sourcenew, 'lxml')
    temp_list.append(browser.url)


    for info in soupnew.find_all('span', class_ = 'org-type' ):
        string_com = str(info.text)
        if len(string_com) == 16:
            string_com = string_com.replace(' ', ' ')[1:-1]
        elif len(string_com) == 11:
            string_com = string_com.replace(' ', ' ')[1:-1]
        elif len(string_com) == 10:
            string_com = string_com.replace(' ', ' ')[1:-1]
        elif len(string_com) == 12:
            string_com = string_com.replace(' ', ' ')[1:-1]
        elif len(string_com) == 13:
            string_com = string_com.replace(' ', ' ')[1:-1]
        else:
            string_com = string_com.replace(' ', ' ')[40:-37]
        temp_list.append(string_com)
    if len(browser.find_by_xpath('//*[@id="main-
content"]/div[2]/div/div/div[2]/p/a')) > 0:
        browser.find_by_xpath('//*[@id="main-
content"]/div[2]/div/div/div[2]/p/a').click()
        time.sleep(2)
        if len(browser.windows) > 1:
            window = browser.windows[1]
            window.is_current = True

            sourcenew2 = browser.html
            soupnew2 = bs.BeautifulSoup(sourcenew2, 'lxml')



    oo = soupnew2.find_all('h3', class_ = 'n-h3')
        for o in oo:
            print(o)
            if """Select the category which best represents your primary activity.""" in o:
                t = o.find('img', class_='readradio')
                if t and '/Style/img/checkedradio.png' in t.get('src'):
                    content = o.find('span', class_='title')
                    temp_list.append(content.text.strip())
                    print(temp_list)

然而,这并未给出输出。我希望输出如下:

    ["Insurance company"]

如果问题得到解答,

    [" "]

如果不是

1 个答案:

答案 0 :(得分:0)

您可以使用以下模式实现此目的:

1)使用tag类迭代每个indent type_^ parent_S以获得子问题;

2)迭代每个h3(子问题):     - 以/Style/img/checkedradio.png为来源的假单选按钮(img);     - 具有checked属性的真实单选按钮;

3)如果找到其中一个,则创建一个键值对并插入先前创建的dict;

4)如果没有,请创建一个空值的键值对,并将其插入之前创建的dict

5)分析数据;

以下是一段代码段,您可以进一步处理它:

import requests
import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("https://reporting.unpri.org/surveys/PRI-Reporting-Framework-2016/680d94eb-3777-49f7-a1c0-3f0ac42b8b5e/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1").text, "html.parser")

parent = soup.select('div[class="indent type_^ parent_S"]')
header_values = {}

for r in parent:
    headers = r.find_all("h3")
    for header in headers:
        if header is not None:
            fake_radio_button = r.find("img", src="/Style/img/checkedradio.png")
            real_radio_button = r.select("input[checked='checked']")

            if fake_radio_button == None:
                if real_radio_button == None:
                    header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = ""
                else:
                    if len(real_radio_button) > 0:
                        header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = real_radio_button[0].attrs["data-original"]
                    else:
                        header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = ""
            else:
                header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = fake_radio_button.parent.find("span").get_text(strip=True)

将输出:

  

{'01.1. Select the category which best represents your primary activity.': 'Insurance company', '01.2. Additional information. [Optional]': 'Insurance company',....}