将复制元素合并到字符串的现有向量

时间:2015-09-02 05:28:20

标签: r

我正在重组大型天气数据集。我正在尝试将复制的字符串附加到列表中,以便重复的字符串出现在列表的每个元素之前。

例如,想象一个表格,其中包含两个不同城市(nedbor)的月度温度和降水量(K and S)。目前其结构使得每行代表2000年至2015年的年份,并且每个月的每个天气变量都有一列。这使得一个非常宽的表(我想要)。

问题在于数据框是从12 .csv files构建的,每个数据框以它所代表的数据的月份命名,以及两个单独的向量来描述多年的不同变量(NAO)。

的输出表
> Weather<-data.frame(Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,NAO,NAOPrevYr)

生成一个包含16行(one for each year 2000-2015)和170列结构的表,以便这些列:

(Year, Month, S.HighTemp, S.LowTemp, S.MeanTemp, S.Nedbor, S.Nedbordage, K.Year, K.Month, K.HighTemp, K.LowTemp, K.MeanTemp,K.Nedbor,K.Nedbordage)

与每个月(14 * 12 = 168)相关联,另外两列(NAO和NAOLastYear)位于最后。 “月份”列中的条目显然会在其各自的月份中重复。但是,由于每个源文件包含相同的列名称,因此Weather数据框中的列名称后跟2月段列的“.1”,3月份的“.2”等。

我想重命名列,以便通用描述符(例如“S.HighTemp”)后跟一个句点,然后是与之关联的月份。所需的输出仍然是一个包含16行和170列的表,除了读取

列的8月部分
(Year.7, Month.7, S.HighTemp.7, S.LowTemp.7, S.MeanTemp.7, S.Nedbor.7, S.Nedbordage.7, K.Year.7, K.Month.7, K.HighTemp.7, K.LowTemp.7, K.MeanTemp.7,K.Nedbor.7,K.Nedbordage.7)

我希望它阅读

(Year.Aug, Month.Aug, S.HighTemp.Aug, S.LowTemp.Aug, S.MeanTemp.Aug, S.Nedbor.Aug, S.Nedbordage.Aug, K.Year.Aug, K.Month.Aug, K.HighTemp.Aug, K.LowTemp.Aug, K.MeanTemp.Aug,K.Nedbor.Aug,K.Nedbordage.Aug)

并且每个14变量月度块的行为类似。

我尝试了什么:

names(Weather)<-c(c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                    "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                    "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                    "K.Nedbordage")+c(rep(".Jan",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Feb",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Mar",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Apr",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".May",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Jun",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Jul",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Aug",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Sep",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Oct",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Nov",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Dec",times=14)),
                  NAO, NAOPrevYr)

不幸的是,这给了我一个错误,表明我正在尝试将非数字参数应用于二元运算符。我假设这是因为我将“+”与字符串向量组合在一起。

我搜索了与合并字符串相关的信息。我在网上找到的相关材料在设计上过于线性,我正在尝试做的事情。

例如,

R Programming: Automating Merge of Character Strings将字符串一起添加到字符串向量中。但是我想在向量之间合并字符串,几乎就像取两个相邻的变量列和几个月一样,然后在那之间消除单元格的划分(the list would then be in a top-to-bottom order)。 Merging vectors of strings in a list in R,实际上只是向量中条目的重新排列。和
How to merge vectors into a list in R?仍然声称合并向量,但实际上似乎只是附加向量。

基本上我对这个很陌生,并且仍在想出整个R的事情。如果您对我能查找的内容有任何想法,请告诉我。必须有一个更好的方法来做到这一点...

1 个答案:

答案 0 :(得分:2)

实际上,当您想要组合字符串时,不应使用+运算符(用于数字数据)。相反,您可以使用paste函数(在R中键入?paste以获取更多信息)。

以下是一个例子:

# The first part of your column names
base_names = c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
    "S.Nedbor","S.Nedbordage","K.Year","K.Month",
    "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
    "K.Nedbordage")

# Paste a month
paste0(base_names, ".Jan")

这会返回一个这样的向量:

[1] "Year.Jan"         "Month.Jan"        "S.HighTemp.Jan"   "S.LowTemp.Jan"    "S.MeanTemp.Jan"   "S.Nedbor.Jan"     "S.Nedbordage.Jan"
 [8] "K.Year.Jan"       "K.Month.Jan"      "K.HighTemp.Jan"   "K.LowTemp.Jan"    "K.MeanTemp.Jan"   "K.Nedbor.Jan"     "K.Nedbordage.Jan"

要做你所有的月份,你不一定需要通过&#34; hand&#34;来构建名字向量。 (就像你在你的例子中尝试过的那样)。你可以以某种方式自动化它。以下是一些不同的解决方案。

# Create a vector with months
months = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Nov", "Dec")

1)使用for循环

# Create an empty vector to store the new column names
new_names = c()

# Paste each month to the base_names and add it to the new_names vector
for(month in months){
    new_names = c(new_names, paste0(base_names, ".", month))
}

2)使用sapply功能

# This creates a matrix with each base_name and month pasted together
new_names = sapply(months, function(month, base_names){
    paste0(base_names, ".", month)
}, base_names = base_names)

# Convert the result to a vector
new_names = as.vector(new_names)

3)使用expand.grid

# This creates a table with all combinations of base_names and months
new_names = expand.grid(base_names, months)

# Paste the two columns together to return a vector
new_names = paste0(new_names[,1], ".", new_names[,2])

编辑:

在评论中回答OP的问题,我为上述解决方案的工作原理添加了一些(希望是清晰的)解释。

问题1)

for循环中,变量month正在采用向量months中的每个值,一次一个。因此,在循环的每次迭代中,变量month将具有不同的值。只需打印变量month

即可试用
for(month in months){ print(month) }

您还可以构建一个&#34;迭代器&#34;变量,然后调用months向量的第i个元素。在这种情况下,我制作一个变量i,取值为1到12(月的长度)。这种方法有效,但在您的情况下是不必要的:

for(i in 1:length(months)){
    print(month[i])
}

问题2)

这是关于R中向量运算的好处。的确,paste()将&#34;回收&#34;一个向量,如果它比粘贴的其他向量短。 要理解这一点,请看看如果粘贴两个长度相同的向量会发生什么:

paste(c("A", "B", "C", "D", "E"), 1:5)
## "A 1" "B 2" "C 3" "D 4" "E 5"

现在不同长度的矢量:

paste(c("A", "B", "C", "D", "E"), 1:2)
[1] "A 1" "B 2" "C 1" "D 2" "E 1"

了解第二个向量的值如何重复使用,直到第一个向量的所有元素都完成为止。因此,如果您只为第二个向量使用一个值,paste()将根据需要多次重复该值:

paste(c("A", "B", "C", "D", "E"), 1)
[1] "A 1" "B 1" "C 1" "D 1" "E 1"

问题3)

基本上apply()函数系列的工作方式有点像for循环,因此答案与问题1的答案类似。基本上,sapply()将迭代months向量的每个元素,并将其作为函数中的第一个变量传递(我称之为month)。同样,在for循环中,您可以使用索引,但在这种情况下没有必要。

值得注意的是,使用apply()通常是&#34; R&#34;做循环的方式,因为for循环通常较慢。