如何使用bs4从复选框中获取文本?

时间:2017-06-09 20:18:40

标签: python web-scraping beautifulsoup bs4

我试图从以下网站上的已检查框(或问题已解答)中获取所有标签(文本)。

然而我似乎没有得到任何文字。

我想要进行刮擦的方式更多的是首先收集所有链接 - 在右侧,您可以在页面之间切换。看起来这个列表的所有链接都是2 ...

这是我当前的代码(请参阅其中的链接,也称为import bs4 as bs from splinter import Browser import time executable_path = {'executable_path' :'C:/users/chromedriver.exe'} browser = Browser('chrome', **executable_path) main_url = 'https://reporting.unpri.org/surveys/PRI-Reporting-Framework- 2016/0ad07cdc-cfbc-4c5b-a79f- 2b07e93d8521/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1' browser.visit(main_url) source = browser.html soup = bs.BeautifulSoup(source, 'lxml') base_url = main_url[:-51] urls = [] print(base_url) for i in soup.find_all('div', class_ = 'accordion-inner n-accordion-link'): for j in soup.find_all('a', class_ = 'tooltiper'): urls.append(j['href']) print(urls) result = [] for k in urls: ext = k[8:] browser.visit(base_url + ext) source1 = browser.html soup1 = bs.BeautifulSoup(source1, 'lxml') temp_list = [] print(browser.url) for img in soup1.find_all('img', class_ = 'readradio'): for t in img['src']: if t == '/Style/img/checkedradio.png': for x in soup1.find_all('span', class_ = 'title'): txt = str(x.string) temp_list.append(txt) result.append(temp_list) print(result)

[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

我得到结果列表的以下输出,该列表应该包含文本:

import bs4 as bs
from splinter import Browser
import time



executable_path = {'executable_path' 
:'/users/nichlasrasmussen/documents/webdrivers/phantomjs'}
browser = Browser('phantomjs', **executable_path)

main_url = 'https://reporting.unpri.org/surveys/PRI-Reporting-Framework-
2016/0ad07cdc-cfbc-4c5b-a79f-
2b07e93d8521/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1'
browser.visit(main_url)
source = browser.html
soup = bs.BeautifulSoup(source, 'lxml')
base_url = main_url[:-51]
urls = []
print(base_url)

for i in soup.find_all('div', class_ = 'accordion-inner n-accordion-link'):
    for j in soup.find_all('a', class_ = 'tooltiper'):
        urls.append(j['href'])

    print(urls)

result = []
for k in urls:
    ext = k[8:]
    browser.visit(base_url + ext)
    source1 = browser.html
    soup1 = bs.BeautifulSoup(source1, 'lxml')
    temp_list = []
    print(browser.url)
    for label in soup1.find_all('label', class_='radio'):
    t = label.find('img', class_='readradio')
    if 'checkedradio' in t['src']:
        content = soup1.find('span', class_='title')
        temp_list.append(content.text)

result.append(temp_list)
print(result)

更新了建议代码:

<?php 
require 'php-sdk/facebook.php';
$facebook = new Facebook (array(
'appId' => 'appId',
'secret' => 'appsecret'
));
?>

<!DOCTYPE html>
<html lang="en">

    <head>
    <meta charset="utf-8"/>
        <title>Facebook PHP </title>
    </head>
<body>

    <?php 
    $user = $facebook -> getUser();
    if($user)
        echo 'User ID:' , $user , '</p>';
    else:  
    $logoutUrl = $facebook->getLogoutUrl();
       echo '<p><a href="' , $loginUrl, '">login</a></p>';
     endif;
    ?>

1 个答案:

答案 0 :(得分:0)

您基本上只需引用imgspan.title元素的父级即可。

无需从根(label.radio

开始进行大循环

试试这个:

soup1