因此,我制作了一个Python程序,读取了我的基因访问号的csv文件,并尝试根据每个基因访问号提出的请求拉出335个URL,但是我得到了:
InvalidSchema:找不到“ [the urls ...]”的连接适配器
我的代码是:
import urllib.request as urllib
from bs4 import BeautifulSoup
def fresh_soup(url):
'''
Collects and parses the page source from a given url, returns the parsed page source
- url : the url you wish to scrape
'''
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib.Request(url,headers=hdr)
source = urllib.urlopen(req,timeout=10).read()
soup = BeautifulSoup(source,"lxml")
return soup
###
import csv
result = []
for line in open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt"):
result.append(line.split('/t'))
csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
for gene in csv.readline().split('/t'):
url = 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'
def build_url(gene):
return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'
genes_urls = [build_url(gene) for gene in csv]
print(genes_urls)
import requests
r = requests.get(genes_urls)
我可以做些什么使它正确地请求每个URL吗?
在旁注:我认为所生成的某些URL的名称带有后退和正斜杠,但是当我手动将其复制到浏览器中时,它的响应就好像不是问题,仍然可以访问该页面我想要的我是否仍应尝试全部使用一种斜杠?