使用python从html表中获取值

时间:2017-05-01 11:21:43

标签: python html web-scraping beautifulsoup pyqt4

尝试从网址http://nextfunds.jp/lineup/1357/detail.html中抓取交易值取引値。如果我使用inspect元素,我可以看到值1,875。 (您可以ctrl+f取引値或1,875看看我需要什么价值。但我没有在源代码中看到这些值。 我的帐篷是刮蟒蛇。我尝试使用

import requests
url='http://nextfunds.jp/lineup/1357/detail.html'
response = requests.get(url)
html = response.content
print html
soup = BeautifulSoup(html)

由于1,875或取引値不在html源代码中,现在是否有办法刮掉这些值?感谢

更新1:尝试了lxml

from lxml import html
page = requests.get(url)
tree=html.fromstring(page.content)
#copied xpath using chrome inspect element
val= tree.xpath('//*[@id="include"]/div[1]/div[2]/table/tbody/tr[1]/td')
val
[]

更新2:使用此链接https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

尝试了Webkit(非常接近于解决)
import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://nextfunds.jp/lineup/1357/detail.html'  
r = Render(url)  
result = r.frame.toHtml()
#now print result in a file and open it in browser to copy xpath of the desired table data
#but somehow some table values are missing (I thought it was a website issue but no !)

更新3(得到了值!,坚持选择表格)

>>> import dryscrape
>>> from bs4 import BeautifulSoup
>>> session = dryscrape.Session()
>>> session.visit(url)
>>> response = session.body()
>>> soup = BeautifulSoup(response)
>>> html = soup.prettify("utf-8")
>>> f1.write(html)
#Now I do see my required table values, but beautifulesoup doesnt let use xpath, I just need to select the table and save it as csv

更新4 我发现我感兴趣的html是在url的pagesource中给出的。我只需要在页面源中搜索模式src="http://nam.qri.jp/cgi-bin/nextfunds/json?SRC=nextfunds/lineup&code=1570&auth=以获取链接。然后使用答案部分中给出的代码。现在这更像是一个正则表达式问题。我可以使用'curl and grep'来实现它,但是只想在python中使用它。

1 个答案:

答案 0 :(得分:1)

该网站通过JS填充这些值。您可以模拟这些请求并以json格式获取该数据。为了从第二个表中获取值,您可以使用以下代码:

import requests
from lxml import html

def parse(link='http://nextfunds.jp/lineup/1357/detail.html'):
    source = requests.get(link)
    t = html.fromstring(source.content)
    # get a url to json page from startpage source
    url = t.xpath('//@src[contains(.,"json?")]')[0]
    # request to json page
    req = requests.get(url)
    tree = html.fromstring(req.content)
    # parse json page and get value
    data = tree.xpath('(//table)[2]//td/text()')
    for item in data[::2]:
        print(item.encode('ascii', 'ignore').decode())

parse()
# 1,847
# 184,101
# 2,112.71
# 1,209
# Can work with different url
parse('http://nextfunds.jp/lineup/1627/detail.html')