Question

我正在尝试使用BeautifulSoup从网站上抓取一些数据。我可以选择td标签，但它不包含我期望的所有子标签。我的目标是遍历具有id =“highlight_today”的td标记并检索所有今天的事件。我试图刮的网址是http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container。这是另一个页面http://www.bloomberg.com/markets/economic-calendar中的iframe。我认为另一个iframe可能是我的for循环不工作的原因而且我没有检索到我在td中期望的所有标签。我的HTML体验非常有限，所以我不确定。我的代码如下：

 <span class="econoarticles"><a href="byshoweventfull.asp?fid=476382&amp;cust=b-us&amp;year=2016&amp;lid=0&amp;containerId=eco-iframe-container&amp;prev=/byweek.asp#top">Daniel Tarullo Speaks<br></a></span>

我希望检索td中包含的所有标签，但它最终停止在html代码中的这个项目上，并且不会进入td中的下一个div：

      Fare Cabin  Pclass  Ticket  
257  86.50     B       1  110152   
504  86.50     B       1  110152   
759  86.50     B       1  110152   
585  79.65     E       1  110413   
262  79.65     E       1  110413  
219  10.500   NaN      2  W/C14208   
745  71.000     B      1  WE/P5735   
540  71.000     B      1  WE/P5735 
244  23.450   NaN      3  W./C.6607   
888  23.450   NaN      3  W./C.6607   
783  23.450   NaN      3  W./C.6607   
33   23.450   NaN      3  W./C.6607 
475  52.0       A      1  110465    
110  52.0       C      1  110465   
305  151.55     C      1  113781   
297  151.55     C      1  113781   
306  151.55     C      1  113781   
498  151.55     C      1  113781   
708  151.55   NaN      1  113781   
141  151.55   NaN      1  113781

可能有一种比我目前的代码更好的方法来完成此任务。在这方面我仍然非常业余，所以对如何实现我的目标的所有建议持开放态度。

Answer 1

soup.find()会返回一个标记。也许您打算使用find_all()？

另外，为什么您希望找到具有给定ID的多个元素？ HTML ID（应该是）在整个文档中是唯一的。

Answer 2

试试这段代码。使用Selenium Web驱动程序。

from selenium import webdriver
import time
import datetime
import csv
import re
from datetime import timedelta
import pandas as pd #use pandas dataframe to store downloaded data
from StringIO import StringIO

driver = webdriver.Firefox()  # Optional argument, if not specified will search path.
driver.get("http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container")
time.sleep(5)
table=driver.find_element_by_xpath("html/body/table/tbody/tr[4]/td/table[1]/tbody/tr/td/table[2]/tbody/tr[2]")
#table=driver.find_element_by_class_name('eventstable')
columns=table.find_elements_by_tag_name('td')
time.sleep(1)
#option 1 get the hole column
for col in columns:
    print(col.text)
#option 2 get info row by row, but information is hided in to different classes
for col in columns_list:
    rows=col.find_elements_by_tag_name('div')
    for row in rows:
        print(row.text)
    rows=col.find_elements_by_tag_name('span')
    for row in rows:
        print(row.text)

上一个新闻栏的结果将是：

Market Focus »


Daniel Tarullo Speaks
10:15 AM ET

Baker-Hughes Rig Count
1:00 PM ET

John Williams Speaks
2:30 PM ET

您可以解析此字符串。使用不同的类名来搜索必要的信息。

Answer 3

有一个带有id highlight_today的 td ，所有的孩子都包含在标签中，所以你只需要拉动它，如果你想迭代孩子，你可以调用find_all：

string[] roles = { "Admin", "Moderator", "User" };

// Create Role through RoleManager

var roleStore = new RoleStore<IdentityRole>(context);
var manager = new RoleManager<IdentityRole>(roleStore);

foreach (string role in roles)
{
    if (!context.Roles.Any(r => r.Name == role))
    {
       manager.Create(new IdentityRole(role));
    }

// Create Role through RoleStore

var roleStore = new RoleStore<IdentityRole>(context);

foreach (string role in roles)
{
    if (!context.Roles.Any(r => r.Name == role))
    {
       roleStore.CreateAsync(new IdentityRole(role));
    }
}

哪会给你：

import requests
from bs4 import BeautifulSoup


url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container'
r = requests.get(url_to_scrape)
html = r.content
soup = BeautifulSoup(html, "html.parser")
event = soup.find('td', {'id': "highlight_today"})
for tag in event.find_all():
    print(tag)

html实际上已损坏，因此您需要 lxml 或 html5lib 来解析i。然后为了获得你想要的东西，你需要找到<div class="econoitems"> <a href="byshoweventfull.asp?fid=476407&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Market Focus »</a> </div> <a href="byshoweventfull.asp?fid=476407&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Market Focus »</a> <a href="byshoweventfull.asp?fid=476407&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Market Focus »</a> » <div class="itembreak"></div> <a href="byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Daniel Tarullo Speaks </a> <a href="byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Daniel Tarullo Speaks </a> 类的跨度并做一些额外的工作来获得时间：

econoarticles

如果我们跑步，请给你：

 url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container'
r = requests.get(url_to_scrape)
html = r.content
soup = BeautifulSoup(html, "lxml")
event = soup.find('td', {'id': "highlight_today"})
for span in event.select("span.econoarticles"):
    speaker, time, a = span.text, span.find_next_sibling(text=True), span.a["href"]
    print(speaker, time, a)

如果你想要In [2]: import requests ...: from bs4 import BeautifulSoup ...: url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-ifr ...: ame-container' ...: r = requests.get(url_to_scrape) ...: html = r.content ...: soup = BeautifulSoup(html, "lxml") ...: event = soup.find('td', {'id': "highlight_today"}) ...: for span in event.select("span.econoarticles"): ...: speaker, time, a = span.text, span.find_next_sibling(text=True), spa ...: n.a["href"] ...: print(speaker, time, a) ...: Daniel Tarullo Speaks 10:15 AM ET byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top John Williams Speaks 2:30 PM ET byshoweventfull.asp?fid=476390&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top In [3]:和网址，只需添加：

Market Focus

美丽的汤 - 用于循环不迭代td内的所有标签

3 个答案: