Question

编辑：感谢那些到目前为止做出回应的人;我是R的初学者，刚刚为我的硕士学位论文做了大型项目，所以对初始处理有点不知所措。我正在使用的数据如下（来自WMO公开可用的降雨数据）：

每次迭代都

120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0 1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03 1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03 1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03 (...) 120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0 1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03 1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03 1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03 (...)

120 6272101 JEBEL AULIA                15.20   32.50  380 1920 1988  0.0

1920   0.03   0.03   0.03   0.00   0.03   6.90  20.00 108.80  47.30   1.00   0.01   0.03

1921   0.03   0.03   0.03   0.00   0.03   0.00  88.00  57.00  35.00  18.50   0.01   0.03

1922   0.03   0.03   0.03   0.00   0.03   0.00  87.50 102.30  10.40  15.20   0.01   0.03

(...)

There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".



I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique.
Thanks again for the help!

(Original question:)
I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year.
The following is a simplified version of my code so far: 

a <- array(1,dim=c(10,12))
for (i in 1:5) {

  all data:
  assign(paste("station_",i,sep=""), a)

  #march - june data:
  assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}

。任何帮助非常感谢！

Answer 1

这完全是对数据框架的乞讨，然后它只是这个带有ddply等功能工具的单线程（非常强大）：

tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))

按年度给出M / A / M / J总计的总和：

   year station_1 station_2 station_3 station_4 station_5 ...
1  1972  8.618960  5.697739 10.083192  9.264512 11.152378 ...
2  1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3  1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4  1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...

以下是完美的代码。我们创建了一个数据框，其 col.names 是'station_n';还有年份和月份的额外列（因子，如果你很懒，则为整数，请参阅脚注）。现在你可以按月或年进行任意分析（使用plyr的split-apply-combine范例）：

require(plyr) # for d*ply, summarise
#require(reshape) # for melt

# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)  
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')

rain <- data.frame(cbind(
  year=rep(c(1970:2011),12),
  month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)

# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)

# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)

# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))

# voila!!
#    year station_1 station_2 station_3 station_4 station_5
# 1  1972  8.618960  5.697739 10.083192  9.264512 11.152378
# 2  1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3  1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4  1975 16.773286 17.683704 18.259066 14.996550 19.007762

作为一个脚注，在我将月份从数字转换为因子之前，它正在默默地“聚合”（直到我放入'-2'：排除列引用）。但是，更好的是当你把它作为一个因素时，它会拒绝积分，并抛出一个错误（这是调试所需的）：

 ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) : 
  sum not meaningful for factors

Answer 2

对于您的原始问题，请使用get（）：

i <- 10
var <- paste("test", i, sep="_")
assign(10, var)
get(var)

正如大卫所说，这可能不是最好的路径，但它有时很有用（而且IMO的assign / get构造远比eval（解析）好）

Answer 3

为什么使用assign创建station1，station2，station_3_mamj等变量？将它们存储在列表中会更容易，更直观，例如stations[[1]]，stations[[2]]，stations_mamj[[3]]等。然后可以使用它们的索引访问每个。

由于您使用的每个站点数据看起来都是相同大小的矩阵，您甚至可以将它们作为三维矩阵处理。

ETA：顺便说一下，如果真的希望以这种方式解决问题，你会这样做：

eval(parse(text=paste("station", i, "mamj", sep="_")))

但是 - 使用eval几乎总是不好的做法，并且很难对数据进行简单的操作。

在R中引用使用变量字符串的对象

3 个答案: