我正在尝试使用以下代码从网站中提取表格(礼貌:Padraic)。当我执行此代码时,它将执行并且不会结束或返回任何内容,除非终止。
from selenium import webdriver
import pandas as pd
dr = webdriver.PhantomJS(r'C:\Users\Admin\Anaconda3\phantomjs-2.1.1-windows\bin\phantomjs.exe')
elements=[]
url='http://www.moneycontrol.com/stocks/fno/marketstats/options/active_calls/index.php'
dr.get(url)
table = dr.find_element_by_css_selector("div.MT15")
for row in table.find_elements_by_xpath(".//tr"):
elem=(":".join([td.text.replace("\n","") for td in \
row.find_elements_by_xpath(".//td")]))
element= elem.split(":")
elements.append(element)
print (elements)
答案 0 :(得分:2)
如果在循环中添加print(row)
,则可以看到如下输出:
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003017")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003018")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003019")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003020")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003021")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003022")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003023")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003024")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003025")>
<selenium.webdriver.remote.webelement.WebElement (session="10676710-2e8c-11e6-b13a-473272d23fd8", element=":wdc:1465509003026")>
源中有 ~1600 tr 标记,其中大部分都在您正在搜索的div中,这就是为什么它似乎循环了很长时间。代码正在运行,只需要一段时间才能完成。
你可能也会发现这只在很短的时间内运行,它在我的笔记本电脑上大约一秒钟就完成了:
import requests
from bs4 import BeautifulSoup, Tag
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
table = soup.select_one("table.tblList")
cols = [th.text for th in table.select_one("tr") if isinstance(th, Tag)]
print(cols)
elems = [[td.text for td in row if isinstance(td, Tag)] for row in table.select("tr + tr")]
print(elems)
如果我们运行代码:
In [13]: import requests
In [14]: from bs4 import BeautifulSoup, Tag
In [15]: url = 'http://www.moneycontrol.com/stocks/fno/marketstats/options/active_calls/index.php'
In [16]: r = requests.get(url)
In [17]: soup = BeautifulSoup(r.content, "lxml")
In [18]: table = soup.select_one("table.tblList")
In [19]: cols = [th.text.strip() for th in table.select_one("tr") if isinstance(th, Tag)]
In [20]: print(cols)
[u'Symbol', u'Expiry\n Date', u'Option Type', u'Strike Price', u'LastPrice', u'Change\n \t\t\t\t\t\t\t\tChg%', u'High\n Low', u'Shares', u'Contracts', u'Value (Rs. Lakh)', u'Open Interest', u'Open Int Chg']
In [21]: elems = [[td.text.strip() for td in row if isinstance(td, Tag)] for row in table.select("tr + tr")]
In [22]: print(elems[0])
[u'IFCI', u'30-Jun-16', u'CE', u'27.50', u'0.50', u'0.25100.00%', u'0.650.20', u'18,760,000', u'938', u'90.05', u'6,000,000', u'2,520,00072.41%']
In [23]: print(elems[-1])
[u'EICHERMOT', u'30-Jun-16', u'CE', u'20,800.00', u'30.00', u'-30.00-50.00%', u'30.0030.00', u'25', u'1', u'0.01', u'225', u'00.00%']
In [24]: len(elems)
Out[24]: 1585
您可以看到表格中有1585行。我只输出了第一行和最后一行,因为有太多的数据要发布,但它会为你提供完整的表格。