不能用python刮一张桌子

时间:2016-02-03 16:52:47

标签: python web-scraping

我正在尝试从cnbc.com获取大学项目的所有权表。我尝试了不同的解决方案,但看起来这个表不包含在HTML中,而是在我使用Web浏览器打开URL时检索。我不知道如何解决它。

任何帮助?

这是我的代码:

from bs4 import BeautifulSoup
import requests
import urllib


url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read(), 'lxml')

for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
    print (row).string

我做了一些更改,这是代码

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    tds = tbody.find_all('td')
    print(tds[0].text, tds[1].text, tds[2].text)

但是我只得到表格的第一行,就是这一行:

Filo (David)  70.7M $2,351,860,831

想知道我如何遍历表格?

3 个答案:

答案 0 :(得分:1)

在Chrome中使用“开发人员工具”我发现您的网页加载了文件

http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O

有预期的数据

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for row in soup.find_all('table', {'class': 'shareholders dotsBelow'} ):
    print(row.text)

结果(它返回许多空行,因为HTML有很多“\ n”):

Name






Shares Held





Position Value






Percentage ofTotal Holdings
since 2/3/16







% Ownedof SharesOutstanding





TurnoverRating







Filo (David)
 70.7M
$2,351,860,831
+9%
7.5%
Low


The Vanguard ...
 49.2M
$1,422,524,414
+6%
5.2%
Low


State Street ...
 34.4M
$993,071,914
+5%
3.6%
Low


BlackRock ...
 32.3M
$935,173,655
+4%
3.4%
Low


Fidelity ...
 24.7M
$714,307,904
+3%
2.6%
Low


Goldman Sachs & ...
 18.6M
$538,561,672
+2%
2.0%
Low


Mason Capital ...
 16.4M
$472,832,995
+2%
1.7%
High


Capital Research ...
 12.6M
$365,108,090
+2%
1.3%
Low


TIAA-CREF
 10.9M
$315,255,311
+1%
1.2%
Low


T. Rowe Price ...
 10.8M
$310,803,286
+1%
1.1%
Low
















Name






Shares Held





Position Value






Percentage ofTotal Holdings
since 2/3/16







% Ownedof SharesOutstanding





InvestmentStyle







Vanguard Total ...
 15.6M
$518,104,623
+2%
1.7%
Index


Vanguard 500 ...
 10.6M
$352,795,106
+1%
1.1%
Index


Vanguard ...
 9.4M
$312,902,098
+1%
1.0%
Index


SPDR S&P 500 ETF
 8.8M
$292,985,112
+1%
0.9%
Index


PowerShares QQQ ...
 7.6M
$252,776,000
+1%
0.8%
Index


Statens ...
 6.7M
$338,173,390
+1%
0.7%
Core Value


First Trust DJ ...
 5.6M
$186,778,215
+1%
0.6%
Index


Janus Twenty Fund
 5.2M
$150,966,054
+1%
0.6%
Growth


CREF Stock Account
 5.0M
$195,517,452
+1%
0.5%
Core Growth


Vanguard Growth ...
 4.8M
$159,879,157
+1%
0.5%
Index

编辑:更好的版本

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    trs = tbody.find_all('tr')
    for tr in trs:
        tds = tr.find_all('td')
        print(tds[0].text, tds[1].text, tds[2].text)

和结果

Filo (David)  70.7M $2,351,860,831
The Vanguard ...  49.2M $1,422,524,414
State Street ...  34.4M $993,071,914
BlackRock ...  32.3M $935,173,655
Fidelity ...  24.7M $714,307,904
Goldman Sachs & ...  18.6M $538,561,672
Mason Capital ...  16.4M $472,832,995
Capital Research ...  12.6M $365,108,090
TIAA-CREF  10.9M $315,255,311
T. Rowe Price ...  10.8M $310,803,286

答案 1 :(得分:0)

我不确定您使用requests的原因。此外,您引用的页面没有“股东”类的元素。

如果删除这两个问题,以下代码将打印出HTML中的所有表格:

from bs4 import BeautifulSoup
import urllib.request

url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read())

for row in soup.find_all('table'):
    print(row)

答案 2 :(得分:0)

如果您想使用请求,请不要将它们与urllib混淆,并将您的代码更改为如下所示,因为没有类别的股东dotBelow'

from bs4 import BeautifulSoup
import requests


url_to_scrape = 'http://data.cnbc.com/quotes/YHOO/tab/8'
response = requests.get(url_to_scrape).content
soup = BeautifulSoup(response)

for row in soup.find_all('table'):
    print row

编辑:

您更改的代码可以使用名称列表:

from bs4 import BeautifulSoup
import requests

url = 'http://apps.cnbc.com/view.asp?country=US&uid=stocks/ownership&symbol=YHOO.O'

response = requests.get(url).content
soup = BeautifulSoup(response, 'lxml')

for tbody in soup.find_all('tbody', id="tBody_institutions"):
    tds = tbody.find_all('td')
    for  zahl,stuff in enumerate(tds):
        if tds[zahl].text in ['Filo (David)', 'The Vanguard ...','State Street ...','T. Rowe Price ...','BlackRock ...','Fidelity ...','Goldman Sachs & ...','Mason Capital ...', 'Capital Research ...','TIAA-CREF']:
            print(tds[zahl].text, tds[zahl + 1 ].text, tds[zahl + 2].text)