我一直在用美丽的汤从网站http://slc.bioparadigms.org
中提取信息但我只对疾病和OMIM号感兴趣,所以对于我已经在列表中已经拥有的每个SLC转运蛋白,我想提取这两个特征。问题是两者都与类prt_col2有关。因此,如果我搜索这个课程,我会得到很多点击。我怎么才能得到这些疾病?有时也没有与SLC转运蛋白有关的疾病或有时没有OMIM号。我如何提取信息?我在下面放了一些截图,向您展示它的样子。任何帮助将非常感谢!这是我在这里发表的第一篇文章,请原谅我任何错误或遗漏的信息。谢谢!
http://imgur.com/aTiGi84另一个是/ L65HSym
理想情况下,输出将是例如:
转运蛋白:SLC1A1
疾病:癫痫 OMIM:12345编辑:我到目前为止的代码:
import os
import re
from bs4 import BeautifulSoup as BS
import requests
import sys
import time
def hasNumbers(inputString): #get transporter names which contain numbers
return any(char.isdigit() for char in inputString)
def get_list(file): #get a list of transporters
transporter_list=[]
lines = [line.rstrip('\n') for line in open(file)]
for line in lines:
if 'SLC' in line and hasNumbers(line) == True:
get_SLC=line.split()
if 'SLC' in get_SLC[0]:
transporter_list.append(get_SLC[0])
return transporter_list
def get_transporter_webinfo(transporter_list):
output_Website=open("output_website.txt", "w") # get the website content of all transporters
for transporter in transporter_list:
text = requests.get('http://slc.bioparadigms.org/protein?GeneName=' + transporter).text
output_Website.write(text) #ouput from the SLC tables website
soup=BS(text, "lxml")
disease = soup(text=re.compile('Disease'))
characteristics=soup.find_all("span", class_="prt_col2")
memo=soup.find_all("span", class_='expandable prt_col2')
print(transporter,disease,characteristics[6],memo)
def convert(html_file):
file2= open(html_file, 'r')
clean_file= open('text_format_SLC','w')
soup=BS(file2,'lxml')
clean_file.write(soup.get_text())
clean_file.close()
def main():
start_time=time.time()
os.chdir('/home/Programming/Fun stuff')
sys.stdout= open("output_SLC.txt","w")
SLC_list=get_list("SLC.txt")
get_transporter_webinfo(SLC_list) #already have the website content so little redundant
print("this took",time.time() - start_time, "seconds to run")
convert("output_SLC.txt")
sys.stdout.close()
if __name__ == "__main__":
main()
答案 0 :(得分:0)
没有违法行为,我没有想要阅读你提出的问题中的大量代码。
我想说它可以简化。
您可以在SLCs =
的行中获取指向SLC的完整链接列表。下一行显示了有多少,并且除此之外的行显示了最后一个链接包含的href
属性,作为示例。
在每个SLC的页面中,我都会查找字符串' Disease'然后,如果它在那里,我导航到附近的链接。我以类似的方式找到了OMIM。
请注意,我只处理第一个SLC。
>>> import requests
>>> import bs4
>>> main_url = 'http://slc.bioparadigms.org/'
>>> main_page = requests.get(main_url).content
>>> main_soup = bs4.BeautifulSoup(main_page, 'lxml')
>>> stem_url = 'http://slc.bioparadigms.org/protein?GeneName=SLC1A1'
>>> SLCs = main_soup.select('td.slct.tbl_cell.tbl_col1 a')
>>> len(SLCs)
418
>>> SLCs[-1].attrs['href']
'protein?GeneName=SLC52A3'
>>> stem_url = 'http://slc.bioparadigms.org/'
>>> for SLC in SLCs:
... SLC_page = requests.get(stem_url+SLC.attrs['href'], 'lxml').content
... SLC_soup = bs4.BeautifulSoup(SLC_page, 'lxml')
... disease = SLC_soup.find_all(string='Disease: ')
... if disease:
... disease = disease[0]
... diseases = disease.findParent().findNextSibling().text.strip()
... else:
... diseases = 'No diseases'
... OMIM = SLC_soup.find_all(string='OMIM:')
... if OMIM:
... OMIM = OMIM[0]
... number = OMIM.findParent().findNextSibling().text.strip()
... else:
... OMIM = 'No OMIM'
... number = -1
... SLC.text, number, diseases
... break
...
('SLC1A1', '133550', "Huntington's disease, epilepsy, ischemia, Alzheimer's disease, Niemann-Pick disease, obsessive-compulsive disorder")