提前谢谢大家。我是webscraping和stackoverflow的新手。我试图从https://glytoucan.org/Structures/Glycans/G00055MO抓取一些生物数据。
我想要搜索的链接来自表格
outerHTML代码是
<a href="http://identifiers.org/pubmed/7503987" target="_blank">7503987</a>
看起来它嵌入了&#34; togostanza&#34;框架。
我尝试了两种不同的方法来查找链接,但我得到的HTML代码不完整。 我试过的方法附在这里:
方法1:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
a=r"https://glytoucan.org/Structures/Glycans/G00055MO"
_glycan1= Render(a)
_result_glycan = _glycan1.frame.toHtml()
# print(_result_glycan)_formatted_result =
str(_result_glycan.encode('utf-8'))
# print(_formatted_result)
_tree = html.fromstring(_formatted_result)
# print(_tree)
_archive_links = _tree.xpath('//a/@href')
print(_archive_links)
此方法返回的链接列表没有我正在寻找的链接。
方法2:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://glytoucan.org/Structures/Glycans/G00055MO")
elem = driver.find_element_by_xpath("//*[@id='literature']/togostanza-literature//main/ul/li/ul/li[1]")
此方法无法找到我输入的xpath。
有人可以帮我找出获取数据的替代方法吗?我很感激。
谢谢, 博坎
---- ----关闭 谢谢大家帮我重新格式化问题。这是stackoverflow上的第一篇文章!
我用PhantomJS和Firefox驱动程序尝试了第二种方法。最后,firefix webdriver可以工作。
答案 0 :(得分:0)
似乎JS正在调用this internal API。输入参数是一个urlencoded sparql查询,如下所示:
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT DISTINCT ?from ?partner_url ?description ?pubmed_id ?pubmed_url
WHERE{
VALUES ?accNum {"G00055MO"}
?saccharide glytoucan:has_primary_id ?accNum .
GRAPH ?graph {
?saccharide dcterms:references ?article .
?article a bibo:Article .
?article dcterms:identifier ?pubmed_id .
?article rdfs:seeAlso ?pubmed_url .
}
?graph rdfs:label ?from .
OPTIONAL {?graph rdfs:seeAlso ?partner_url.}
?graph dcterms:description ?description.
} ORDER by ?from
使用以下内容将获取您的链接:
import requests
query = """
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT DISTINCT ?from ?partner_url ?description ?pubmed_id ?pubmed_url
WHERE{
VALUES ?accNum {"G00055MO"}
?saccharide glytoucan:has_primary_id ?accNum .
GRAPH ?graph {
?saccharide dcterms:references ?article .
?article a bibo:Article .
?article dcterms:identifier ?pubmed_id .
?article rdfs:seeAlso ?pubmed_url .
}
?graph rdfs:label ?from .
OPTIONAL {?graph rdfs:seeAlso ?partner_url.}
?graph dcterms:description ?description.
} ORDER by ?from
"""
headers = {'Accept': 'application/sparql-results+json'}
payload = {'query': query}
r = requests.get('https://ts.glytoucan.org/sparql', params=payload, headers=headers)
print(r.status_code)
data = r.json()
links = [ t["pubmed_url"]["value"] for t in data["results"]["bindings"] ]
print(links)