Question

我是R新手，所以请原谅基本问题。

这是我的数据的.csv a Dropbox link。

我有1990年至2010年的国家数据。我的数据很广：每个国家都是一行，每年有两列对应两个数据源。但是，某些国家/地区的数据并不完整。例如，一个国家/地区行可能在1990-1995列中具有NA值。

我想创建两列，对于每个国家/地区行，我希望这些列中的值是两种数据类型中最早的非NA 值。

我还想创建另外两列，对于每个国家/地区行，我希望这些列中的值是两个数据中每个数据的最早非NA 年类型。

所以最后四列是这样的：

1990, 12, 1990, 87
1990, 7, 1990, 132
1996, 22, 1996, 173
1994, 14, 1994, 124

这是我粗略的半伪代码尝试，我想象嵌套for循环看起来像：

for i in (number of rows){
  for j in names(df){
    if(is.na(df$j) == FALSE)  df$earliest_year = j
  }
}

如何生成这些所需的四列？谢谢！

Answer 1

你提到过循环;所以我试着制作一个for循环。但是你可能想尝试其他的R函数，比如稍后再申请。这段代码有点冗长，希望这对你有所帮助：

# read data; i'm assuming the first column is row name and not important
df <- read.csv("wb_wide.csv", row.names = 1)

# get names of columns for the two datasource
# here I used grep to find columns names using NY and SP pattern; 
# but if the format is consistentto be alternating, 
# you can use sequence of number
dataSourceA <- names(df)[grep(x = names(df), pattern = "NY")]
dataSourceB <- names(df)[grep(x = names(df), pattern = "SP")]

# create new columns for the data set
# if i understand it correctly, first non-NA data from source 1
# and source 2; and then the year of these non-NAs
df$sourceA <- vector(length = nrow(df))
df$yearA <- vector(length = nrow(df))
df$sourceB <- vector(length = nrow(df))
df$yearB <- vector(length = nrow(df))

# start for loop that will iterate per row
for(i in 1:nrow(df)){

  # this is a bit nasty; but the point here is to first select columns for source A
  # then determine non-NAs, after which select the first and store it in the sourceA column
  df$sourceA[i] <- df[i, dataSourceA][which(!is.na(df[i , dataSourceA]))[1]]

  # another nasty one; but I used gsub to clean the column name so that the year will be left
  # you can also skip this and then just clean afterward
  df$yearA[i] <- gsub(x = names(df[i, dataSourceA][which(!is.na(df[i , dataSourceA]))[1]]),
               pattern = "^.*X", replacement = "")

  # same with the first bit of code, but here selecting from source B
  df$sourceB[i] <- df[i, dataSourceB][which(!is.na(df[i , dataSourceB]))[1]]

  # same with the second bit for source B
  df$yearB[i] <- gsub(x = names(df[i, dataSourceB][which(!is.na(df[i , dataSourceB]))[1]]),
               pattern = "^.*X", replacement = "")

}

我尝试使代码特定于您的示例并希望输出。

R嵌套for循环迭代行和列名称

1 个答案: