如何操作data.frame以获得所需的结果?有点滞后吗?

时间:2019-04-08 22:46:27

标签: r dataframe

我试图以一种难以描述的方式来操纵数据表。我的直觉告诉我这有点滞后,但我不确定-我将举例说明。

我可以在excel中轻松完成此操作,但是我的数据集太大,无法使excel有效处理。

初始数据:

Column1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C")
Column2 <- c(201801, 201802, 201803, 201804, 201803, 201804, 201805, 201803, 201804)
Column3 <- c("Active", "Active", "Active", "Closed", "Active", "Active", "CO", "Active", "BK")
Column4 <- c(100, 97, 95, 0, 50, 45, 45, 100, 90)
(dat <- dplyr::tibble(Column1, Column2, Column3, Column4))

# A tibble: 9 x 4
  Column1 Column2 Column3 Column4
  <chr>     <dbl> <chr>     <dbl>
1 A        201801 Active      100
2 A        201802 Active       97
3 A        201803 Active       95
4 A        201804 Closed        0
5 B        201803 Active       50
6 B        201804 Active       45
7 B        201805 CO           45
8 C        201803 Active      100
9 C        201804 BK           90

所需的输出:

Col1 <- c("A", "A", "A", "B", "B", "C")
Col2 <- c(201010, 201802, 201803, 201003, 201804, 201803)
Col3 <- c(201802, 201803, 201804, 201804, 201805, 201804)
Col4 <- c("Active", "Active", "Active", "Active", "Active", "Active")
Col5 <- c("Active", "Active", "Closed", "Active", "CO", "BK")
Col6 <- c(100, 97, 95, 50, 45, 100)
Col7 <- c(97, 95, 0, 45, 45, 90)
(dat_desired <- dplyr::tibble(Col1, Col2, Col3, Col4, Col5, Col6, Col7))

# A tibble: 6 x 7
  Col1    Col2   Col3 Col4   Col5    Col6  Col7
  <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl> <dbl>
1 A     201001 201802 Active Active   100    97
2 A     201802 201803 Active Active    97    95
3 A     201803 201804 Active Closed    95     0
4 B     201003 201804 Active Active    50    45
5 B     201804 201805 Active CO        45    45
6 C     201803 201804 Active BK       100    90

顺便说一句,根据下面的一些建议,我尝试了以下操作(但是在下面产生了错误):

library(zoo)

R <- read_excel("H:/R Programs/R_Data.xlsx")

Column1 = as.vector(R[,1])
Column2 = as.vector(R[,2])
Column3 = as.vector(R[,3])
Column4 = as.vector(R[,4])
Column5 = as.vector(R[,5])

(dat <- dplyr::tibble(Column1, Column2, Column3, Column4, Column5))

# A tibble: 415,533 x 5
Column1$Loan_Key Column2$File_Run_Date Column3$Status Column4$Days Column5$Bal
<dbl>                 <dbl> <chr>                 <dbl>       <dbl>
1         11111111              20180201 ACTIVE                  -19      24472.
2         11111111              20180301 ACTIVE                  -19      24264.
3         11111111              20180401 ACTIVE                  -19      23991.
4         11111111              20180501 BK                      -49      23350.
5         11111111              20180601 BK                      -19      23488.
6         11111111              20180701 BK                      -19      23169.
7         11111111              20180801 BK                      -19      23008.
8         11111111              20180901 BK                      -19      22693.
9         11111111              20181001 BK                      -19      22378.
10        11111111              20181101 BK                      -19      22192.

# ... with 415,523 more rows

(data.frame(rollapply(data = dat, 2, c)) %>% filter(X1 == X2) %>% 
    select(-X2) %>% setNames(paste0("Col", 1:9)))

Error in ncol(xj) : object 'xj' not found

3 个答案:

答案 0 :(得分:2)

这基本上不是一种自我联接(即left_join的{​​{1}}与dat上的dat的自联接)吗?

"Column1"

答案 1 :(得分:1)

您可以使用Zoo的rollapply()函数来完成此操作:

library(plyr)
library(dplyr)
library(zoo)

rollapply(data = dat, 2, c) %>% # returns a character matrix
  data.frame() %>%
  colwise(type.convert, as.is = T)(.) %>% # Guesses column classes
  filter(X1 == X2) %>% # only reports the same Column1 values
  select(-X2) %>%
  setNames(paste0("Col", 1:7)) %>%
  as_tibble() # optional

# A tibble: 6 x 7
  Col1    Col2   Col3 Col4   Col5    Col6  Col7
  <chr>  <int>  <int> <chr>  <chr>  <int> <int>
1 A     201801 201802 Active Active   100    97
2 A     201802 201803 Active Active    97    95
3 A     201803 201804 Active Closed    95     0
4 B     201803 201804 Active Active    50    45
5 B     201804 201805 Active CO        45    45
6 C     201803 201804 Active BK       100    90

这假定行已经按正确的顺序。

答案 2 :(得分:0)

感谢Paul(以及您的其他人),我认为这使我到达了需要去的地方。我花了相当长的时间才实现了最终变成1行的代码。...:(

R <- read_excel("H:/R Programs/R_Data.xlsx")
x=data.frame(rollapply(data = R, 2, c)) #THIS WAS THE KEY STEP
Transition = x %>% filter(as.character(X1)==as.character(X2)) %>% select(-X2) #THIS AS.CHARACTER ALSO CAME INTO PLAY DUE TO DIFFERENT FACTOR LEVELS (FOR SOME REASON)
write_xlsx(x=Transition, path="C:/Transition_Matrix_Data.xlsx")