我试图从以下网站上的已检查框(或问题已解答)中获取所有标签(文本)。
然而我似乎没有得到任何文字。
我想要进行刮擦的方式更多的是首先收集所有链接 - 在右侧,您可以在页面之间切换。看起来这个列表的所有链接都是2 ...
这是我当前的代码(请参阅其中的链接,也称为import bs4 as bs
from splinter import Browser
import time
executable_path = {'executable_path' :'C:/users/chromedriver.exe'}
browser = Browser('chrome', **executable_path)
main_url = 'https://reporting.unpri.org/surveys/PRI-Reporting-Framework-
2016/0ad07cdc-cfbc-4c5b-a79f-
2b07e93d8521/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1'
browser.visit(main_url)
source = browser.html
soup = bs.BeautifulSoup(source, 'lxml')
base_url = main_url[:-51]
urls = []
print(base_url)
for i in soup.find_all('div', class_ = 'accordion-inner n-accordion-link'):
for j in soup.find_all('a', class_ = 'tooltiper'):
urls.append(j['href'])
print(urls)
result = []
for k in urls:
ext = k[8:]
browser.visit(base_url + ext)
source1 = browser.html
soup1 = bs.BeautifulSoup(source1, 'lxml')
temp_list = []
print(browser.url)
for img in soup1.find_all('img', class_ = 'readradio'):
for t in img['src']:
if t == '/Style/img/checkedradio.png':
for x in soup1.find_all('span', class_ = 'title'):
txt = str(x.string)
temp_list.append(txt)
result.append(temp_list)
print(result)
)
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
我得到结果列表的以下输出,该列表应该包含文本:
import bs4 as bs
from splinter import Browser
import time
executable_path = {'executable_path'
:'/users/nichlasrasmussen/documents/webdrivers/phantomjs'}
browser = Browser('phantomjs', **executable_path)
main_url = 'https://reporting.unpri.org/surveys/PRI-Reporting-Framework-
2016/0ad07cdc-cfbc-4c5b-a79f-
2b07e93d8521/79894dbc337a40828d895f9402aa63de/html/2/?lang=&a=1'
browser.visit(main_url)
source = browser.html
soup = bs.BeautifulSoup(source, 'lxml')
base_url = main_url[:-51]
urls = []
print(base_url)
for i in soup.find_all('div', class_ = 'accordion-inner n-accordion-link'):
for j in soup.find_all('a', class_ = 'tooltiper'):
urls.append(j['href'])
print(urls)
result = []
for k in urls:
ext = k[8:]
browser.visit(base_url + ext)
source1 = browser.html
soup1 = bs.BeautifulSoup(source1, 'lxml')
temp_list = []
print(browser.url)
for label in soup1.find_all('label', class_='radio'):
t = label.find('img', class_='readradio')
if 'checkedradio' in t['src']:
content = soup1.find('span', class_='title')
temp_list.append(content.text)
result.append(temp_list)
print(result)
更新了建议代码:
<?php
require 'php-sdk/facebook.php';
$facebook = new Facebook (array(
'appId' => 'appId',
'secret' => 'appsecret'
));
?>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Facebook PHP </title>
</head>
<body>
<?php
$user = $facebook -> getUser();
if($user)
echo 'User ID:' , $user , '</p>';
else:
$logoutUrl = $facebook->getLogoutUrl();
echo '<p><a href="' , $loginUrl, '">login</a></p>';
endif;
?>
答案 0 :(得分:0)
您基本上只需引用img
和span.title
元素的父级即可。
无需从根(label.radio
)
试试这个:
soup1