我正在尝试从cnbc.com获取大学项目的所有权表。我尝试了不同的解决方案,但看起来这个表不包含在HTML中,而是在我使用Web浏览器打开URL时检索。我不知道如何解决它。
任何帮助?
这是我的代码:
from bs4 import BeautifulSoup
import requests
import urllib
url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read(), 'lxml')
for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
print (row).string
我做了一些更改,这是代码
from bs4 import BeautifulSoup
import requests
url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'
response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')
for tbody in soup.find_all('tbody', id="tBody_institutions"):
tds = tbody.find_all('td')
print(tds[0].text, tds[1].text, tds[2].text)
但是我只得到表格的第一行,就是这一行:
Filo (David) 70.7M $2,351,860,831
想知道我如何遍历表格?
答案 0 :(得分:1)
在Chrome中使用“开发人员工具”我发现您的网页加载了文件
http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O
有预期的数据
from bs4 import BeautifulSoup
import requests
url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'
response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')
for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
print(row.text)
结果(它返回许多空行,因为HTML有很多“\ n”):
Name
Shares Held
Position Value
Percentage ofTotal Holdings
since 2/3/16
% Ownedof SharesOutstanding
TurnoverRating
Filo (David)
70.7M
$2,351,860,831
+9%
7.5%
Low
The Vanguard ...
49.2M
$1,422,524,414
+6%
5.2%
Low
State Street ...
34.4M
$993,071,914
+5%
3.6%
Low
BlackRock ...
32.3M
$935,173,655
+4%
3.4%
Low
Fidelity ...
24.7M
$714,307,904
+3%
2.6%
Low
Goldman Sachs & ...
18.6M
$538,561,672
+2%
2.0%
Low
Mason Capital ...
16.4M
$472,832,995
+2%
1.7%
High
Capital Research ...
12.6M
$365,108,090
+2%
1.3%
Low
TIAA-CREF
10.9M
$315,255,311
+1%
1.2%
Low
T. Rowe Price ...
10.8M
$310,803,286
+1%
1.1%
Low
Name
Shares Held
Position Value
Percentage ofTotal Holdings
since 2/3/16
% Ownedof SharesOutstanding
InvestmentStyle
Vanguard Total ...
15.6M
$518,104,623
+2%
1.7%
Index
Vanguard 500 ...
10.6M
$352,795,106
+1%
1.1%
Index
Vanguard ...
9.4M
$312,902,098
+1%
1.0%
Index
SPDR S&P 500 ETF
8.8M
$292,985,112
+1%
0.9%
Index
PowerShares QQQ ...
7.6M
$252,776,000
+1%
0.8%
Index
Statens ...
6.7M
$338,173,390
+1%
0.7%
Core Value
First Trust DJ ...
5.6M
$186,778,215
+1%
0.6%
Index
Janus Twenty Fund
5.2M
$150,966,054
+1%
0.6%
Growth
CREF Stock Account
5.0M
$195,517,452
+1%
0.5%
Core Growth
Vanguard Growth ...
4.8M
$159,879,157
+1%
0.5%
Index
编辑:更好的版本
from bs4 import BeautifulSoup
import requests
url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'
response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')
for tbody in soup.find_all('tbody', id="tBody_institutions"):
trs = tbody.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
print(tds[0].text, tds[1].text, tds[2].text)
和结果
Filo (David) 70.7M $2,351,860,831
The Vanguard ... 49.2M $1,422,524,414
State Street ... 34.4M $993,071,914
BlackRock ... 32.3M $935,173,655
Fidelity ... 24.7M $714,307,904
Goldman Sachs & ... 18.6M $538,561,672
Mason Capital ... 16.4M $472,832,995
Capital Research ... 12.6M $365,108,090
TIAA-CREF 10.9M $315,255,311
T. Rowe Price ... 10.8M $310,803,286
答案 1 :(得分:0)
我不确定您使用requests
的原因。此外,您引用的页面没有“股东”类的元素。
如果删除这两个问题,以下代码将打印出HTML中的所有表格:
from bs4 import BeautifulSoup
import urllib.request
url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read())
for row in soup.find_all('table'):
print(row)
答案 2 :(得分:0)
如果您想使用请求,请不要将它们与urllib混淆,并将您的代码更改为如下所示,因为没有类别的股东dotBelow'
from bs4 import BeautifulSoup
import requests
url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(response)
for row in soup.find_all('table'):
print row
编辑:
您更改的代码可以使用名称列表:
from bs4 import BeautifulSoup
import requests
url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'
response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')
for tbody in soup.find_all('tbody', id="tBody_institutions"):
tds = tbody.find_all('td')
for zahl,stuff in enumerate(tds):
if tds[zahl].text in ['Filo (David)', 'The Vanguard ...','State Street ...','T. Rowe Price ...','BlackRock ...','Fidelity ...','Goldman Sachs & ...','Mason Capital ...', 'Capital Research ...','TIAA-CREF']:
print(tds[zahl].text, tds[zahl + 1 ].text, tds[zahl + 2].text)