尝试从网址http://nextfunds.jp/lineup/1357/detail.html中抓取交易值取引値。如果我使用inspect元素,我可以看到值1,875。 (您可以ctrl+f
取引値或1,875看看我需要什么价值。但我没有在源代码中看到这些值。
我的帐篷是刮蟒蛇。我尝试使用
import requests
url='http://nextfunds.jp/lineup/1357/detail.html'
response = requests.get(url)
html = response.content
print html
soup = BeautifulSoup(html)
由于1,875或取引値不在html源代码中,现在是否有办法刮掉这些值?感谢
更新1:尝试了lxml
from lxml import html
page = requests.get(url)
tree=html.fromstring(page.content)
#copied xpath using chrome inspect element
val= tree.xpath('//*[@id="include"]/div[1]/div[2]/table/tbody/tr[1]/td')
val
[]
尝试了Webkit(非常接近于解决)
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://nextfunds.jp/lineup/1357/detail.html'
r = Render(url)
result = r.frame.toHtml()
#now print result in a file and open it in browser to copy xpath of the desired table data
#but somehow some table values are missing (I thought it was a website issue but no !)
更新3(得到了值!,坚持选择表格)
>>> import dryscrape
>>> from bs4 import BeautifulSoup
>>> session = dryscrape.Session()
>>> session.visit(url)
>>> response = session.body()
>>> soup = BeautifulSoup(response)
>>> html = soup.prettify("utf-8")
>>> f1.write(html)
#Now I do see my required table values, but beautifulesoup doesnt let use xpath, I just need to select the table and save it as csv
更新4
我发现我感兴趣的html是在url的pagesource中给出的。我只需要在页面源中搜索模式src="http://nam.qri.jp/cgi-bin/nextfunds/json?SRC=nextfunds/lineup&code=1570&auth=
以获取链接。然后使用答案部分中给出的代码。现在这更像是一个正则表达式问题。我可以使用'curl and
grep'来实现它,但是只想在python中使用它。
答案 0 :(得分:1)
该网站通过JS填充这些值。您可以模拟这些请求并以json格式获取该数据。为了从第二个表中获取值,您可以使用以下代码:
import requests
from lxml import html
def parse(link='http://nextfunds.jp/lineup/1357/detail.html'):
source = requests.get(link)
t = html.fromstring(source.content)
# get a url to json page from startpage source
url = t.xpath('//@src[contains(.,"json?")]')[0]
# request to json page
req = requests.get(url)
tree = html.fromstring(req.content)
# parse json page and get value
data = tree.xpath('(//table)[2]//td/text()')
for item in data[::2]:
print(item.encode('ascii', 'ignore').decode())
parse()
# 1,847
# 184,101
# 2,112.71
# 1,209
# Can work with different url
parse('http://nextfunds.jp/lineup/1627/detail.html')