我正在尝试使用BeautifulSoup从网站上抓取一些数据。我可以选择td标签,但它不包含我期望的所有子标签。我的目标是遍历具有id =“highlight_today”的td标记并检索所有今天的事件。我试图刮的网址是http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container。这是另一个页面http://www.bloomberg.com/markets/economic-calendar中的iframe。我认为另一个iframe可能是我的for循环不工作的原因而且我没有检索到我在td中期望的所有标签。我的HTML体验非常有限,所以我不确定。我的代码如下:
<span class="econoarticles"><a href="byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Daniel Tarullo Speaks<br></a></span>
我希望检索td中包含的所有标签,但它最终停止在html代码中的这个项目上,并且不会进入td中的下一个div:
Fare Cabin Pclass Ticket
257 86.50 B 1 110152
504 86.50 B 1 110152
759 86.50 B 1 110152
585 79.65 E 1 110413
262 79.65 E 1 110413
219 10.500 NaN 2 W/C14208
745 71.000 B 1 WE/P5735
540 71.000 B 1 WE/P5735
244 23.450 NaN 3 W./C.6607
888 23.450 NaN 3 W./C.6607
783 23.450 NaN 3 W./C.6607
33 23.450 NaN 3 W./C.6607
475 52.0 A 1 110465
110 52.0 C 1 110465
305 151.55 C 1 113781
297 151.55 C 1 113781
306 151.55 C 1 113781
498 151.55 C 1 113781
708 151.55 NaN 1 113781
141 151.55 NaN 1 113781
可能有一种比我目前的代码更好的方法来完成此任务。在这方面我仍然非常业余,所以对如何实现我的目标的所有建议持开放态度。
答案 0 :(得分:0)
soup.find()
会返回一个标记。也许您打算使用find_all()
?
另外,为什么您希望找到具有给定ID的多个元素? HTML ID(应该是)在整个文档中是唯一的。
答案 1 :(得分:0)
试试这段代码。使用Selenium Web驱动程序。
from selenium import webdriver
import time
import datetime
import csv
import re
from datetime import timedelta
import pandas as pd #use pandas dataframe to store downloaded data
from StringIO import StringIO
driver = webdriver.Firefox() # Optional argument, if not specified will search path.
driver.get("http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container")
time.sleep(5)
table=driver.find_element_by_xpath("html/body/table/tbody/tr[4]/td/table[1]/tbody/tr/td/table[2]/tbody/tr[2]")
#table=driver.find_element_by_class_name('eventstable')
columns=table.find_elements_by_tag_name('td')
time.sleep(1)
#option 1 get the hole column
for col in columns:
print(col.text)
#option 2 get info row by row, but information is hided in to different classes
for col in columns_list:
rows=col.find_elements_by_tag_name('div')
for row in rows:
print(row.text)
rows=col.find_elements_by_tag_name('span')
for row in rows:
print(row.text)
上一个新闻栏的结果将是:
Market Focus »
Daniel Tarullo Speaks
10:15 AM ET
Baker-Hughes Rig Count
1:00 PM ET
John Williams Speaks
2:30 PM ET
您可以解析此字符串。使用不同的类名来搜索必要的信息。
答案 2 :(得分:0)
有一个带有id highlight_today的 td ,所有的孩子都包含在标签中,所以你只需要拉动它,如果你想迭代孩子,你可以调用find_all:
string[] roles = { "Admin", "Moderator", "User" };
// Create Role through RoleManager
var roleStore = new RoleStore<IdentityRole>(context);
var manager = new RoleManager<IdentityRole>(roleStore);
foreach (string role in roles)
{
if (!context.Roles.Any(r => r.Name == role))
{
manager.Create(new IdentityRole(role));
}
// Create Role through RoleStore
var roleStore = new RoleStore<IdentityRole>(context);
foreach (string role in roles)
{
if (!context.Roles.Any(r => r.Name == role))
{
roleStore.CreateAsync(new IdentityRole(role));
}
}
哪会给你:
import requests
from bs4 import BeautifulSoup
url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container'
r = requests.get(url_to_scrape)
html = r.content
soup = BeautifulSoup(html, "html.parser")
event = soup.find('td', {'id': "highlight_today"})
for tag in event.find_all():
print(tag)
html实际上已损坏,因此您需要 lxml 或 html5lib 来解析i。然后为了获得你想要的东西,你需要找到<div class="econoitems"><br/><span class="econoitems"><a href="byshoweventfull.asp?fid=476407&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Market Focus <span class="econo-item-arrow">»</span></a></span><br/></div>
<br/>
<span class="econoitems"><a href="byshoweventfull.asp?fid=476407&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Market Focus <span class="econo-item-arrow">»</span></a></span>
<a href="byshoweventfull.asp?fid=476407&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Market Focus <span class="econo-item-arrow">»</span></a>
<span class="econo-item-arrow">»</span>
<br/>
<br/>
<div class="itembreak"></div>
<br/>
<span class="econoarticles"><a href="byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Daniel Tarullo Speaks<br/></a></span>
<a href="byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top">Daniel Tarullo Speaks<br/></a>
<br/>
类的跨度并做一些额外的工作来获得时间:
econoarticles
如果我们跑步,请给你:
url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container'
r = requests.get(url_to_scrape)
html = r.content
soup = BeautifulSoup(html, "lxml")
event = soup.find('td', {'id': "highlight_today"})
for span in event.select("span.econoarticles"):
speaker, time, a = span.text, span.find_next_sibling(text=True), span.a["href"]
print(speaker, time, a)
如果你想要In [2]: import requests
...: from bs4 import BeautifulSoup
...: url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-ifr
...: ame-container'
...: r = requests.get(url_to_scrape)
...: html = r.content
...: soup = BeautifulSoup(html, "lxml")
...: event = soup.find('td', {'id': "highlight_today"})
...: for span in event.select("span.econoarticles"):
...: speaker, time, a = span.text, span.find_next_sibling(text=True), spa
...: n.a["href"]
...: print(speaker, time, a)
...:
Daniel Tarullo Speaks 10:15 AM ET byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top
John Williams Speaks 2:30 PM ET byshoweventfull.asp?fid=476390&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top
In [3]:
和网址,只需添加:
Market Focus