Question

因此，我制作了一个Python程序，读取了我的基因访问号的csv文件，并尝试根据每个基因访问号提出的请求拉出335个URL，但是我得到了：

InvalidSchema：找不到“ [the urls ...]”的连接适配器

我的代码是：

import urllib.request as urllib

from bs4 import BeautifulSoup

def fresh_soup(url):    
'''
Collects and parses the page source from a given url, returns the parsed page source 
- url : the url you wish to scrape
'''
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib.Request(url,headers=hdr) 
source = urllib.urlopen(req,timeout=10).read() 
soup = BeautifulSoup(source,"lxml")  

return soup
###


import csv

result = []
for line in open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt"):
result.append(line.split('/t'))

csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
for gene in csv.readline().split('/t'):
url = 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'


def build_url(gene):
return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'

genes_urls = [build_url(gene) for gene in csv]


print(genes_urls)

import requests

r = requests.get(genes_urls)

我可以做些什么使它正确地请求每个URL吗？

在旁注：我认为所生成的某些URL的名称带有后退和正斜杠，但是当我手动将其复制到浏览器中时，它的响应就好像不是问题，仍然可以访问该页面我想要的我是否仍应尝试全部使用一种斜杠？

解析从csv生成的请求？

0 个答案: