Question

我想从搜索结果下载第一个pdb文件（下面给出名称下载链接）。我正在使用python，selenium和beautifulsoup。到目前为止，我已经开发了代码。

import urllib2
from BeautifulSoup import BeautifulSoup
from selenium import webdriver


uni_id = "P22216"

# set parameters
download_dir = "/home/home/Desktop/"
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id

print "url - ", url


# opening the url
text = urllib2.urlopen(url).read();

#print "text : ", text
soup = BeautifulSoup(text);
#print soup
print


table = soup.find( "table", {"class":"queryBlue"} )
#print "table : ", table

status = 0
rows = table.findAll('tr')
for tr in rows:
    try:
        cols = tr.findAll('td')
        if cols:
            link = cols[1].find('a').get('href')
        print "link : ", link
            if link:
                if status==1:
                    main_url = "http://www.rcsb.org" + link
                print "main_url-----", main_url
                status = False
                browser.click(main_url)
        status+=1

    except:
    pass

我的表格为无如何在搜索列表中下载第一个文件？（在这种情况下是2YGV）

Download link is : /pdb/protein/P32447

Answer 1

我不确定您要下载的是什么，但以下是如何下载2YGV文件的示例：

import urllib
import urllib2
from bs4 import BeautifulSoup    

uni_id = "P22216"    
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id    
text = urllib2.urlopen(url).read()    
soup = BeautifulSoup(text)    
link = soup.find( "span", {"class":"iconSet-main icon-download"}).parent.get("href")    
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")

此脚本将从页面上的链接下载该文件。此脚本不需要selenium，但我使用urllib来检索文件。有关如何使用urllib下载文件的详细信息，请阅读this post。

修改

或者使用此代码查找下载链接（这完全取决于您要下载哪个URL的文件）：

import urllib import urllib2 from bs4 import BeautifulSoup uni_id = "P22216" url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id text = urllib2.urlopen(url).read() soup = BeautifulSoup(text) table = soup.find( "table", {"class":"queryBlue"} ) link = table.find("a", {"class":"tooltip"}).get("href") urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")

以下是您如何做评论中提出的问题的示例：

import mechanize from bs4 import BeautifulSoup SEARCH_URL = "http://www.rcsb.org/pdb/home/home.do" l = ["YGL130W", "YDL159W", "YOR181W"] browser = mechanize.Browser() for item in l: browser.open(SEARCH_URL) browser.select_form(nr=0) browser["q"] = item html = browser.submit() soup = BeautifulSoup(html) table = soup.find("table", {"class":"queryBlue"}) if table: link = table.find("a", {"class":"tooltip"}).get("href") browser.retrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")[0] print "Downloaded " + item + " as " + str(link.split("=")[-1]) + ".pdb" else: print item + " was not found"

<强>输出：

Downloaded YGL130W as 3KYH.pdb Downloaded YDL159W as 3FWB.pdb YOR181W was not found

使用python beautifulsoup和selenium下载文件

1 个答案: