在Google ml-engine(tensorflow)中读取数据桶中的数据

时间:2017-09-19 20:21:17

标签: python tensorflow google-cloud-ml-engine

我在从Google托管的存储桶中读取数据时遇到问题。 我有一个桶,包含我需要访问的~1000个文件,保存在(例如)   gs:// my-bucket / data

使用命令行或其他Google Python客户端中的gsutil,我可以访问存储桶中的数据,但默认情况下,google-cloud-ml-engine不支持导入这些API。

我需要一种方法来访问数据和文件的名称,使用默认的python库(即os)或使用tensorflow。我知道tensorflow在某个地方内置了这个功能,我很难找到

理想情况下,我正在寻找一个命令的替换,例如os.listdir()和另一个命令,用于open()

#install.packages("dismo")
library(dismo)
#install.packages("scales")
library(scales)
#install.packages("rgdal")
library(rgdal)
#install.packages("rgeos")
library(rgeos)
#install.packages("rJava")
library(rJava)
#install.packages("foreach")
library(foreach)
#install.packages("doParallel")
library(doParallel)

#Colors to use in the plots
MyRbw2<-c('#f4f4f4','#3288bd','#66c2a5','#e6f598','#fee08b','#f46d43','#9e0142')
colfunc_myrbw2<-colorRampPalette(MyRbw2)

#Create empty lists to recieve outputs
xm_list<-list()
xm_spc_list<-list()
e_spc_list<-list()
px_spc_list<-list()
tr_spc_list<-list()
spc_pol1<-list()
spc_pol5<-list()
tr<-list()


#Create empty data frame to recieve treshold values for each species
tr_df<-data.frame(matrix(NA, nrow=92, ncol=7))
tr_df[,1]<-as.character(tree_list)
names(tr_df)<- c('spp',"kappa","spec_sens","no_omission","prevalence","equal_sens_spec","sensitivity")


# Assigning objects to run Maxent
data_points <- tree_cd_points # this is a list with SpatialPoints for 92 species
data_list <- tree_list # list with the species names
counts_data<- counts_tree_cd # number of points for each species
predictors2<-predictors_low # rasterStack of Bioclim layers (climatic variables), low resolution

#Stablishing extent for Maxent predictions
xmin=-120; xmax=-35; ymin2=-40; ymax=35
limits2 <- c(xmin, xmax, ymin2, ymax)

# Making the cluster for doParallel
cores<-detectCores() # I have 16
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)

#Just to keep track of time
ptime1 <- proc.time()



pdf("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/treesp_maxent_20170823.pdf", 
    paper = "letter", height = 11, width=8,5, pointsize=12,pagecentre = TRUE)
#I have 92 species, but I'll run just the first 4 to test
foreach(i=1:4, .packages=c("dismo","scales","rgdal","rgeos","rJava")) %dopar% {  #Runs only species with 5 or more points to avoid maxent problems

  if (counts_data$n[i]>4) { #If the species has more than 4 occurrence points, run maxent
    tryCatch({ #makes the loop go on despite errors


      #Sets train, test and total points for Maxent
      group <- kfold(x=data_points[[i]], 5)
      pres_train<- data_points[[i]][group != 1, ]
      pres_test <- data_points[[i]][group == 1, ]
      spoints<- data_points[[i]]

      #Sets background points for Maxent
      backg <- randomPoints(predictors2, n=20000, ext=limits2, extf = 1.25)
      colnames(backg) = c('lon', 'lat')
      group <- kfold(backg, 5)
      backg_train <- backg[group != 1, ]
      backg_test <- backg[group == 1, ]



      #The maxent itself (put the xm in the empty list that I created earlier to store all xms)
      xm_spc_list[[i]] <- maxent(x=predictors2, p=spoints, a=backg ,
                   factors='ecoreg',
                   args=c('visible=true',
                          'betamultiplier=1',
                          'randomtestpoints=20',
                          'randomseed=true',
                          'linear=true',
                          'quadratic=true',
                          'product=true',
                          'hinge=true',
                          'threads=4',
                          'responsecurves=true',
                          'jackknife=true',
                          'removeduplicates=false',
                          'extrapolate=true',
                          'pictures=true',
                          'cache=true',
                          'maximumiterations=5000',
                          'askoverwrite=false'),
                   path=paste0("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm/",data_list[i]), overwrite=TRUE)


      par(mfrow=c(1,1),mar = c(2,2, 2, 2))
      plot(xm_spc_list[[i]], main=paste(data_list[i]))
      response(xm_spc_list[[i]])


      #Evaluating how good is the model and putting the evaluation values in a list
      e_spc_list[[i]] <- evaluate(pres_test, backg_test, xm_spc_list[[i]], predictors2) 



      #Predicting the climatic envelopes and Sending to a list os predictions
      px_spc_list[[i]] <- predict(predictors2, xm_spc_list[[i]], ext=limits2,  progress='text', 
                    filename=paste0("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm/",data_list[i],"/",gsub('\\s+', '_', data_list[i]),"_pred.grd"), overwrite=TRUE)



      tr_df[i,2:7]<-threshold(e_spc_list[[i]])
      tr[[i]]<-threshold(e_spc_list[[i]], 'spec_sens')


      #Pol 1 will be the regular polygon, default treshold
      spc_pol1[[i]] <- rasterToPolygons(px_spc_list[[i]]>tr[[i]],function(x) x == 1,dissolve=T)
      writeOGR(obj = spc_pol1[[i]], dsn = paste0("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm1/",data_list[i]), driver = "ESRI Shapefile",
               layer = paste0(gsub('\\s+', '_', data_list[i]),"_pol"), overwrite_layer = TRUE )




      #Pol 5 will be a 100km^2 circle around the occurrence points
      circ <- circles(spoints, d=5642,lonlat=TRUE)
      circ <- circ@polygons
      crs(circ)<-crs(wrld_cropped)
      circ <- gIntersection(wrld_cropped, circ, byid = TRUE, drop_lower_td = TRUE)

      #To write de polygon to a file, the function writeOGR needs an object SPDF, so...
      #Getting Polygon IDs
      circ_df<- as.data.frame(sapply(slot(circ, "polygons"), function(x) slot(x, "ID")))
      #Making the IDs row names 
      row.names(circ_df) <- sapply(slot(circ, "polygons"), function(x) slot(x, "ID"))
      # Make spatial polygon data frame
      circ_SPDF <- SpatialPolygonsDataFrame(circ, data =circ_df)

      #Save the polygon, finally
      writeOGR(obj = circ_SPDF, dsn = paste0("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm5/",data_list[i]), driver = "ESRI Shapefile",
               layer = paste0(gsub('\\s+', '_', data_list[i]),"_pol"), overwrite_layer = TRUE ) 

      spc_pol5[[i]]<-circ_SPDF

      #Now the plots 
      par(mfrow=c(2,3),mar = c(2,1, 1, 1))

      plot(px_spc_list[[i]], axes=FALSE, legend=TRUE, legend.shrink=1, col=colfunc_myrbw2(20), main=paste((data_list[i]),' - Maxent'))
      plot(wrld_cropped,add=TRUE, border='dark grey',axes=FALSE)
      points(data_points[[i]], pch=21,col="white", bg='hotpink', lwd=0.5, cex=0.7)

      plot(wrld_cropped,  border='dark grey', col="#f9f9f9",axes=FALSE, main='px>tr')  
      plot(spc_pol1[[i]] , main=paste((data_list[i]),' - Range'), add=TRUE, col=alpha("green3",0.8),border=alpha("green3",0.8),axes=FALSE)
      points(data_points[[i]], pch="°",col="black",  cex=0.7)

      plot(wrld_cropped,  border='dark grey', col="#f9f9f9",axes=FALSE, main=paste(data_list[i],"circles"))  
      plot(circ,  add=TRUE, col=alpha("green3",0.8),border=alpha("green3",0.8) )
    }, error=function(e){cat("Warning message:",conditionMessage(e), "\n")})


    #But sometimes, even with >4 occurrence points, Maxent fails... 
    #So I'll make sure that if I have >4 points but maxent didn't work, I get the circles anyway
    f<-paste("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm/",data_list[i],"/",gsub('\\s+', '_', data_list[i]),"_pred.grd", sep="")

    gc() #Just collecting garbage to speed up the process

    if (!file.exists(f)){ # then, if f (maxent output) doesn't exist, create the circles at least

      spoints<- data_points[[i]]

      circ <- circles(spoints, d=5642,lonlat=TRUE)
      circ <- circ@polygons
      crs(circ)<-crs(wrld_cropped)
      circ <- gIntersection(wrld_cropped, circ, byid = TRUE, drop_lower_td = TRUE)

      #To write de polygon to a file, the function writeOGR needs an object SPDF, so...
      #Getting Polygon IDs
      circ_df<- as.data.frame(sapply(slot(circ, "polygons"), function(x) slot(x, "ID")))
      #Making the IDs row names 
      row.names(circ_df) <- sapply(slot(circ, "polygons"), function(x) slot(x, "ID"))
      # Make spatial polygon data frame
      circ_SPDF <- SpatialPolygonsDataFrame(circ, data =circ_df)

      #Save the polygon, finally
      #dir.create(paste("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm5/",data_list[i],sep=""))
      writeOGR(obj = circ_SPDF, dsn = paste0("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm5/",data_list[i],sep=""), driver = "ESRI Shapefile",
               layer = paste0(gsub('\\s+', '_', data_list[i]),"_pol"), overwrite_layer = TRUE )  

      spc_pol5[[i]]<-circ_SPDF


      plot(wrld_cropped,  border='dark grey', col="#f9f9f9",axes=FALSE, main=data_list[i])  
      plot(circ,  add=TRUE, col=alpha("green3",0.8),border=alpha("green3",0.8) )
      #plot(spoints,pch=21,col="white", bg='hotpink', lwd=0.1, cex=0.5, add=TRUE)

    }


  } else  { #If the species does not have more than 4 points, 
            #do not run maxent, but create a circles polygon

    spoints<- data_points[[i]]

    #For the circle to have 100km2, d should be 5641.9 ... 
    circ <- circles(spoints, d=5642,lonlat=TRUE)
    circ <- circ@polygons
    crs(circ)<-crs(wrld_cropped)
    circ <- gIntersection(wrld_cropped, circ, byid = TRUE, drop_lower_td = TRUE)

    circ_df<- as.data.frame(sapply(slot(circ, "polygons"), function(x) slot(x, "ID")))
    row.names(circ_df) <- sapply(slot(circ, "polygons"), function(x) slot(x, "ID"))
    circ_SPDF <- SpatialPolygonsDataFrame(circ, data =circ_df)

    writeOGR(obj = circ_SPDF, dsn = paste0("C:/Users/thai/Desktop/Ecologicos/w2/SpDistModel/SEM9/spp/xm5/",data_list[i],sep=""), driver = "ESRI Shapefile",
             layer = paste0(gsub('\\s+', '_', data_list[i]),"_pol"), overwrite_layer = TRUE )  

    par(mfrow=c(1,1),mar = c(2,2, 2, 2))
    plot(wrld_cropped,  border='dark grey', col="#f9f9f9",axes=FALSE, main=data_list[i])  
    plot(circ,  add=TRUE, col=alpha("green3",0.8),border=alpha("green3",0.8) )
    spc_pol5[[i]]<-circ_SPDF

    gc() #collecting garbage before a nuw run
  }

}
dev.off()
dev.off() #to close that pdf I started before the loop


ptime2<- proc.time() - ptime1 #just checking the time
ptime2

read_training_data使用张量流读取器对象

感谢您的帮助! (还有p.s.我的数据是二进制的)

2 个答案:

答案 0 :(得分:3)

如果您只想将数据读入内存,那么this answer会提供您需要的详细信息,即使用file_io模块。

也就是说,您可能需要考虑使用TensorFlow的内置读取机制,因为它们可以更高效。

可以找到有关阅读的信息here。最新且最伟大的(但尚未成为官方&#34;核心&#34; TensorFlow的一部分)是数据集API(更多信息here)。

要记住的一些事情:

  • 您使用的格式TensorFlow可以读取吗?它可以转换成那种格式吗?
  • &#34;喂养的开销是什么?是否足以影响培训绩效?
  • 训练集太大而不适合记忆吗?

如果对一个或多个问题的回答是肯定的,尤其是后两个问题,请考虑使用读者。

答案 1 :(得分:1)

价值多少。我在读取文件时也遇到了问题,特别是从datalab笔记本中的Google云存储中读取二进制文件时。我设法做到的第一种方法是使用gs-utils将文件复制到本地文件系统,然后使用tensorflow正常读取文件。文件复制完成后,将在此处进行演示。

这是我的设置单元格

import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

这是一个用于在本地读取文件以进行完整性检查的单元格。

# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3', 
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
    result = sess.run(waveform)
    print (result)

这里是直接从gs:文件中读取二进制文件。

# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
  result = sess.run(waveform)
  print (result)