我试图抓取复选框(或所有信息),针对以下url的某些问题。
示例我想在" 01.1标题下找到信息。选择最能代表您主要活动的类别。" 如果我不存在,我想要一个空白区域。
这是我目前的代码:
from splinter import *
import bs4 as bs
import os
import time
import csv
from selenium.common.exceptions import ElementNotVisibleException
path = os.getcwd()+'/chromedriver.exe'
executable_path = {'executable_path': path}
browser = Browser('chrome', **executable_path)
urls = ['https://www.unpri.org/organisation/folksam-143819']
for i in urls:
browser.visit(i)
window = browser.windows[0]
window.is_current = True
temp_list = []
sourcenew = browser.html
soupnew = bs.BeautifulSoup(sourcenew, 'lxml')
temp_list.append(browser.url)
for info in soupnew.find_all('span', class_ = 'org-type' ):
string_com = str(info.text)
if len(string_com) == 16:
string_com = string_com.replace(' ', ' ')[1:-1]
elif len(string_com) == 11:
string_com = string_com.replace(' ', ' ')[1:-1]
elif len(string_com) == 10:
string_com = string_com.replace(' ', ' ')[1:-1]
elif len(string_com) == 12:
string_com = string_com.replace(' ', ' ')[1:-1]
elif len(string_com) == 13:
string_com = string_com.replace(' ', ' ')[1:-1]
else:
string_com = string_com.replace(' ', ' ')[40:-37]
temp_list.append(string_com)
if len(browser.find_by_xpath('//*[@id="main-
content"]/div[2]/div/div/div[2]/p/a')) > 0:
browser.find_by_xpath('//*[@id="main-
content"]/div[2]/div/div/div[2]/p/a').click()
time.sleep(2)
if len(browser.windows) > 1:
window = browser.windows[1]
window.is_current = True
sourcenew2 = browser.html
soupnew2 = bs.BeautifulSoup(sourcenew2, 'lxml')
oo = soupnew2.find_all('h3', class_ = 'n-h3')
for o in oo:
print(o)
if """Select the category which best represents your primary activity.""" in o:
t = o.find('img', class_='readradio')
if t and '/Style/img/checkedradio.png' in t.get('src'):
content = o.find('span', class_='title')
temp_list.append(content.text.strip())
print(temp_list)
然而,这并未给出输出。我希望输出如下:
["Insurance company"]
如果问题得到解答,
[" "]
如果不是
答案 0 :(得分:0)
您可以使用以下模式实现此目的:
1)使用tag
类迭代每个indent type_^ parent_S
以获得子问题;
2)迭代每个h3
(子问题):
- 以/Style/img/checkedradio.png
为来源的假单选按钮(img);
- 具有checked
属性的真实单选按钮;
3)如果找到其中一个,则创建一个键值对并插入先前创建的dict
;
4)如果没有,请创建一个空值的键值对,并将其插入之前创建的dict
。
5)分析数据;
以下是一段代码段,您可以进一步处理它:
import requests
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("https://reporting.unpri.org/surveys/PRI-Reporting-Framework-2016/680d94eb-3777-49f7-a1c0-3f0ac42b8b5e/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1").text, "html.parser")
parent = soup.select('div[class="indent type_^ parent_S"]')
header_values = {}
for r in parent:
headers = r.find_all("h3")
for header in headers:
if header is not None:
fake_radio_button = r.find("img", src="/Style/img/checkedradio.png")
real_radio_button = r.select("input[checked='checked']")
if fake_radio_button == None:
if real_radio_button == None:
header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = ""
else:
if len(real_radio_button) > 0:
header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = real_radio_button[0].attrs["data-original"]
else:
header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = ""
else:
header_values[re.sub(r'[\t\r\n]', '', header.get_text(strip=True).strip())] = fake_radio_button.parent.find("span").get_text(strip=True)
将输出:
{'01.1. Select the category which best represents your primary activity.': 'Insurance company', '01.2. Additional information. [Optional]': 'Insurance company',....}