解析从csv生成的请求?

时间:2018-08-05 19:11:29

标签: python web-scraping beautifulsoup python-requests

因此,我制作了一个Python程序,读取了我的基因访问号的csv文件,并尝试根据每个基因访问号提出的请求拉出335个URL,但是我得到了:

InvalidSchema:找不到“ [the urls ...]”的连接适配器

我的代码是:

import urllib.request as urllib

from bs4 import BeautifulSoup

def fresh_soup(url):    
'''
Collects and parses the page source from a given url, returns the parsed page source 
- url : the url you wish to scrape
'''
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib.Request(url,headers=hdr) 
source = urllib.urlopen(req,timeout=10).read() 
soup = BeautifulSoup(source,"lxml")  

return soup
###


import csv

result = []
for line in open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt"):
result.append(line.split('/t'))

csv = open("C:/Projects/NCBI Scraper project/geneAccNumbers.txt", 'r')
for gene in csv.readline().split('/t'):
url = 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'


def build_url(gene):
return 'https://www.ncbi.nlm.nih.gov/nuccore/' + gene + '.1?report=fasta'

genes_urls = [build_url(gene) for gene in csv]


print(genes_urls)

import requests

r = requests.get(genes_urls)

我可以做些什么使它正确地请求每个URL吗?

在旁注:我认为所生成的某些URL的名称带有后退和正斜杠,但是当我手动将其复制到浏览器中时,它的响应就好像不是问题,仍然可以访问该页面我想要的我是否仍应尝试全部使用一种斜杠?

0 个答案:

没有答案