无法使用beautifulsoup提取表数据

时间:2016-06-05 09:10:26

标签: python web-scraping beautifulsoup

我一直试图从这个表中提取数据并且遇到很多困难。任何提示/帮助表示赞赏 http://finance.yahoo.com/options/lists/?mod_id=mediaquotesoptions&tab=tab2&rcnt=50

我的代码如下

import urllib
import urllib.request

from bs4 import BeautifulSoup
import pandas as pd
def make_soup(url):
    thepage=urllib.request.urlopen(url)
    soupdata=BeautifulSoup(thepage, "html.parser")
    return soupdata

soup=make_soup('http://finance.yahoo.com/options/lists/?mod_id=mediaquotesoptions&tab=tab2&rcnt=50')
soup.findAll('td')

由于

2 个答案:

答案 0 :(得分:1)

内容位于脚本标记内部,因此在您获取的源代码中没有表,如果您查看实际源代码,而不是开发人员工具或firebug中的脚本标记,则会动态创建该表:

<script>YMedia.use("media-rmp", "media-viewport-loader", function(Y){Y.Global.Media.ViewportLoader.addContainers([{selector: "#mediaquotesoptions_container",callback: 

您将看到所有数据都在其中。

您可以尝试解析包含数据的脚本标记,或者更可靠的方法是使用selenium,您可以将其与PhantomJS结合使用以进行无头浏览:

from selenium import webdriver
dr = webdriver.PhantomJS()
dr.get("http://finance.yahoo.com/options/lists/?mod_id=mediaquotesoptions&tab=tab2&rcnt=50")
table = dr.find_element_by_css_selector("div.yfi-panel-container.tabview-content")
headers = ",".join([th.text for th in table.find_elements_by_xpath(".//th/span")])
print(headers)
for row in table.find_elements_by_xpath(".//tr"):
    print(",".join([td.text.replace("\n","") for td in row.find_elements_by_xpath(".//td")]))

如果我们运行会给出:

In [22]: print(headers)
TICKER,OPTION SYMBOL,OPTION,CLOSE,CHANGE,VOLUME,VOLUMEN CHANGE,OPEN INTEREST,OPEN INTEREST CHANGE

In [23]: for row in table.find_elements_by_xpath(".//tr"):
   ....:         print(",".join([td.text.replace("\n","") for td in row.find_elements_by_xpath(".//td")]))
   ....:     

XLF,XLF160819P00023000,AUG 16 23 Put,0.60,+0.13(+27.70%),184,769,+184,719(+369438.00%),29,504,+24(+0.80%)
EMC,EMC160715P00023000,JUL 16 23 Put,0.25,-0.10(-28.60%),13,003,+12,997(+216616.70%),256,844,00.00%
KATE,KATE161021C00021000,OCT 16 21 Call,2.35,-0.30(-11.30%),7,012,+7,006(+116766.70%),334,+6(+22.20%)
EFA,EFA160617P00057000,JUN 16 57 Put,0.16,-0.11(-40.70%),10,565,+10,555(+105550.00%),197,134,+50.00%
CIEN,CIEN161021C00023000,OCT 16 23 Call,1.21,+0.38(+45.80%),39,838,+39,799(+102048.70%),884,+13(+17.30%)
GLD,GLD160715P00109000,JUL 16 109 Put,0.17,-0.22(-56.40%),6,484,+6,477(+92528.60%),14,014,+5(+0.40%)
XME,XME160617P00020500,JUN 16 20.5 Put,0.13,-0.10(-43.50%),10,054,+10,041(+77238.50%),102,974,00.00%
ATVI,ATVI160617C00040000,JUN 16 40 Call,0.19,-0.19(-50.00%),20,429,+20,401(+72860.70%),75,224,+7(+0.10%)
GDX,GDX160617C00021500,JUN 16 21.5 Call,3.95,+2.30(+139.40%),6,046,+6,037(+67077.80%),142,944,+10.00%
ALXN,ALXN160617C00160000,JUN 16 160 Call,2.20,-1.00(-31.30%),3,728,+3,722(+62033.30%),4,734,-5(-1.00%)
XOP,XOP160715P00034000,JUL 16 34 Put,1.09,+0.11(+11.20%),4,081,+4,074(+58200.00%),83,414,+6(+0.10%)
ABX,ABX160819C00020000,AUG 16 20 Call,1.49,+0.87(+140.30%),3,047,+3,041(+50683.30%),974,+6(+6.60%)
ETE,ETE160715C00016000,JUL 16 16 Call,0.75,-0.10(-11.80%),10,137,+10,116(+48171.40%),58,684,+20(+0.30%)
SLV,SLV160916C00018000,SEP 16 18 Call,0.28,+0.07(+33.30%),4,529,+4,519(+45190.00%),16,014,00.00%
AAL,AAL160617P00031000,JUN 16 31 Put,0.88,+0.37(+72.50%),13,147,+13,117(+43723.30%),10,204,-4(-0.40%)
FXE,FXE160715C00111000,JUL 16 111 Call,1.25,+0.73(+140.40%),3,058,+3,051(+43585.70%),9,914,-1(-0.10%)
XLU,XLU160617C00050000,JUN 16 50 Call,0.54,+0.29(+116.00%),15,493,+15,456(+41773.00%),185,764,-60.00%
GM,GM160715P00027000,JUL 16 27 Put,0.30,+0.08(+36.40%),6,675,+6,657(+36983.30%),1,534,+18(+13.30%)
XBI,XBI160617P00057500,JUN 16 57.5 Put,1.30,+0.55(+73.30%),2,181,+2,175(+36250.00%),16,234,+1,265(+353.40%)
HAL,HAL170120P00035000,JAN 17 35 Put,1.40,-0.14(-9.10%),2,136,+2,130(+35500.00%),78,844,00.00%
SLCA,SLCA160617C00031000,JUN 16 31 Call,1.50,+0.85(+130.80%),7,057,+7,037(+35185.00%),1,094,+20(+22.50%)
DD,DD160617C00070000,JUN 16 70 Call,0.56,+0.13(+30.20%),3,838,+3,825(+29423.10%),44,324,-3(-0.10%)
FXI,FXI160715C00034000,JUL 16 34 Call,0.82,-0.01(-1.20%),20,356,+20,279(+26336.40%),135,884,+42(+0.30%)
ALK,ALK160715P00060000,JUL 16 60 Put,0.95,+0.20(+26.70%),1,553,+1,547(+25783.30%),4,564,00.00%
SRPT,SRPT160617C00026000,JUN 16 26 Call,1.95,0.000.00%,3,073,+3,061(+25508.30%),1,354,-5(-3.60%)
NUE,NUE160617P00048000,JUN 16 48 Put,0.26,-0.24(-48.00%),6,040,+6,016(+25066.70%),9,604,-20(-2.00%)
VLO,VLO160715P00050000,JUL 16 50 Put,0.58,+0.08(+16.00%),6,023,+5,999(+24995.80%),4,124,+12(+3.00%)
EA,EA170120P00062500,JAN 17 62.5 Put,2.57,+0.07(+2.80%),1,505,+1,499(+24983.30%),2,344,+6(+2.60%)
JOY,JOY160715C00025000,JUL 16 25 Call,0.33,+0.01(+3.10%),2,436,+2,426(+24260.00%),1,664,+3(+1.80%)
DXJ,DXJ160715C00044780,JUL 16 44.78 Call,0.38,-0.16(-29.60%),6,676,+6,647(+22920.70%),347,114,+28(+0.10%)
AVGO,AVGO170120C00170000,JAN 17 170 Call,13.80,+3.00(+27.80%),1,594,+1,587(+22671.40%),41,784,+5(+0.10%)
LVS,LVS160715P00046000,JUL 16 46 Put,1.72,-0.03(-1.70%),3,392,+3,377(+22513.30%),12,214,+10(+0.80%)
T,T160916C00039000,SEP 16 39 Call,1.15,+0.25(+27.80%),4,179,+4,160(+21894.70%),1,174,+18(+18.20%)
EWZ,EWZ160916P00027000,SEP 16 27 Put,2.20,-0.46(-17.30%),2,463,+2,451(+20425.00%),249,804,00.00%
CF,CF160617P00030000,JUN 16 30 Put,1.85,-0.53(-22.30%),2,003,+1,993(+19930.00%),62,384,00.00%
NEM,NEM160715C00035000,JUL 16 35 Call,2.42,+1.38(+132.70%),1,286,+1,279(+18271.40%),24,514,00.00%
EBAY,EBAY160617P00024500,JUN 16 24.5 Put,0.72,+0.14(+24.10%),7,010,+6,971(+17874.40%),624,+29(+87.90%)
WWAV,WWAV160617C00047500,JUN 16 47.5 Call,0.45,+0.05(+12.50%),1,760,+1,750(+17500.00%),19,654,+10(+0.50%)
UUP,UUP160916C00025000,SEP 16 25 Call,0.19,-0.14(-42.40%),3,609,+3,588(+17085.70%),160,754,+13(+0.10%)
AEM,AEM160617P00047000,JUN 16 47 Put,0.53,-2.17(-80.40%),3,426,+3,406(+17030.00%),3,364,00.00%
IWM,IWM160617C00120000,JUN 16 120 Call,0.10,-0.11(-52.40%),15,260,+15,170(+16855.60%),624,424,+480(+0.80%)
PSX,PSX170120P00065000,JAN 17 65 Put,2.10,+0.15(+7.70%),1,339,+1,331(+16637.50%),10,084,+4(+0.40%)
LLY,LLY160715P00072500,JUL 16 72.5 Put,1.47,+0.16(+12.20%),8,799,+8,745(+16194.40%),103,054,+18(+0.20%)
GDXJ,GDXJ170120C00045000,JAN 17 45 Call,3.40,+1.35(+65.90%),1,607,+1,597(+15970.00%),40,244,-10(-0.20%)
YHOO,YHOO170120P00028000,JAN 17 28 Put,0.86,+0.04(+4.90%),3,528,+3,505(+15239.10%),181,584,+15(+0.10%)
SO,SO160617C00050000,JUN 16 50 Call,0.65,+0.34(+109.70%),1,486,+1,476(+14760.00%),46,594,-8(-0.20%)
SH,SH160819C00021000,AUG 16 21 Call,0.30,0.000.00%,2,228,+2,213(+14753.30%),13,964,+15(+1.10%)
GS,GS160617P00145000,JUN 16 145 Put,0.32,+0.08(+33.30%),7,086,+7,038(+14662.50%),24,794,-43(-1.70%)
JD,JD160715P00021000,JUL 16 21 Put,0.50,+0.10(+25.00%),1,916,+1,903(+14638.50%),1,074,+9(+9.20%)
DAL,DAL160617C00042000,JUN 16 42 Call,0.73,-0.52(-41.60%),17,138,+17,021(+14547.90%),191,164,-80.00%

答案 1 :(得分:0)

为了完成这类任务,总是需要进行一些html代码探索。我使用Chrome的元素检查器。要下载源代码,您有很多选择。最简单的方法是使用常规浏览器手动下载(如果你不需要很多htmls,我建议你这样做)。如果你想从Python下载它我推荐使用Selenium,默认情况下等待直到执行网页的脚本。

import bs4
import pandas as pd

def cleanList(lst):
    #This function removes breakline elements from list lst
    return [i for i in lst if i !='\n']
url='http://finance.yahoo.com/options/lists/?mod_id=mediaquotesoptions&tab=tab2&rcnt=50'
#driver_path is the directory where you have your driver
browser=webdriver.Chrome(driver_path)       
browser.get(url)
soup = bs4.BeautifulSoup(browser.page_source, 'lxml')
#The target table is the third returned by find_all('table')
rows=soup.find_all('table')[3].find_all('tr')
#Fields are in the first row
fields=cleanList(rows[0].find_all(text=True))
#We will store each row as a dict in a list
elements=[]
for row in rows[1:]:
    values=cleanList(row.find_all(text=True))
    #Some elements that belong to the same cell are separated so the code gets a bit uglier
    values[4]='|'.join(values[4:6])
    del values[5]
    values[6]='|'.join(values[7:9])
    del values[7]
    values[8]='|'.join(values[8:10])
    del values[9]
    element=dict((f,str(v)) for f,v in zip(fields,values))
    elements.append(element)
#concat results
df=pd.DataFrame(elements)

这是生成的DataFrame:table_yahoo.csv