使用python进行Webscraping:信息不完整,由togostanza框架隐藏

时间:2017-12-03 03:07:46

标签: javascript python web-scraping

提前谢谢大家。我是webscraping和stackoverflow的新手。我试图从https://glytoucan.org/Structures/Glycans/G00055MO抓取一些生物数据。

我想要搜索的链接来自表格

enter image description here

outerHTML代码是

<a href="http://identifiers.org/pubmed/7503987" target="_blank">7503987</a>

看起来它嵌入了&#34; togostanza&#34;框架。

enter image description here

我尝试了两种不同的方法来查找链接,但我得到的HTML代码不完整。 我试过的方法附在这里:

方法1:

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html

class Render(QWebPage):

    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()
a=r"https://glytoucan.org/Structures/Glycans/G00055MO"

_glycan1= Render(a)
_result_glycan = _glycan1.frame.toHtml()
# print(_result_glycan)_formatted_result = 
str(_result_glycan.encode('utf-8'))
# print(_formatted_result)
_tree = html.fromstring(_formatted_result)
# print(_tree)
_archive_links = _tree.xpath('//a/@href')
print(_archive_links)

此方法返回的链接列表没有我正在寻找的链接。

方法2:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("https://glytoucan.org/Structures/Glycans/G00055MO")
elem = driver.find_element_by_xpath("//*[@id='literature']/togostanza-literature//main/ul/li/ul/li[1]")

此方法无法找到我输入的xpath。

有人可以帮我找出获取数据的替代方法吗?我很感激。

谢谢, 博坎

---- ----关闭 谢谢大家帮我重新格式化问题。这是stackoverflow上的第一篇文章!

我用PhantomJS和Firefox驱动程序尝试了第二种方法。最后,firefix webdriver可以工作。

1 个答案:

答案 0 :(得分:0)

似乎JS正在调用this internal API。输入参数是一个urlencoded 查询,如下所示:

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>

SELECT DISTINCT ?from ?partner_url ?description ?pubmed_id ?pubmed_url
WHERE{
    VALUES ?accNum {"G00055MO"}
    ?saccharide  glytoucan:has_primary_id ?accNum .

    GRAPH ?graph {
        ?saccharide dcterms:references ?article .
        ?article a bibo:Article .
        ?article dcterms:identifier ?pubmed_id .
        ?article rdfs:seeAlso ?pubmed_url .
    }
    ?graph rdfs:label ?from .
    OPTIONAL {?graph rdfs:seeAlso ?partner_url.}
    ?graph dcterms:description ?description.
} ORDER by ?from

使用以下内容将获取您的链接:

import requests

query = """
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>

SELECT DISTINCT ?from ?partner_url ?description ?pubmed_id ?pubmed_url
WHERE{
    VALUES ?accNum {"G00055MO"}
    ?saccharide  glytoucan:has_primary_id ?accNum .

    GRAPH ?graph {
        ?saccharide dcterms:references ?article .
        ?article a bibo:Article .
        ?article dcterms:identifier ?pubmed_id .
        ?article rdfs:seeAlso ?pubmed_url .
    }
    ?graph rdfs:label ?from .
    OPTIONAL {?graph rdfs:seeAlso ?partner_url.}
    ?graph dcterms:description ?description.
} ORDER by ?from
"""

headers = {'Accept': 'application/sparql-results+json'}
payload = {'query': query}

r = requests.get('https://ts.glytoucan.org/sparql', params=payload, headers=headers)

print(r.status_code)
data = r.json()
links = [ t["pubmed_url"]["value"] for t in data["results"]["bindings"] ]
print(links)