使用tidyr进行重组

时间:2017-03-24 18:56:56

标签: r tidyr

我有1个长字符串(每行)内的数据。基本上它以分号分隔,列/答案由=分隔。我试图做以下事情:

目前的结构:

 Row1: “Column1 = blah1;Column2 = blah2;Column3 = blah3;Column4 = blah4” 
 Row2: “Column1 = blah1;Column2 = blah2;Column3 = blah3;Column4 = blah4”

转换为 - >

Column1|Column2|Column3|Column4
blah1|blah2|blah3|blah4
blah1|blah2|blah3|blah4

我相信R中的tidyr套餐是要走的路,但我还没能弄明白。

我已经使用tidyr,但我仍然遇到错误:

# CREATE TEST DATA
mydata <- as.data.frame(c("Column1 = blah1; Column2 =  blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 =  blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 =  blah2; Column3 = blah3; Column4 = blah4"))
names(mydata) <- "TEST"

# Create dummy vector
x <- vector(mode="numeric", length=0)

# Separate by ;
x <- separate(mydata, TEST, x, sep = ";" )

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:2)

我将使用dplyr pipes来逐步显示如何执行此操作,在每个步骤后打印输出,以便您了解数据结构如何演变。

mydata <- as.data.frame(c("Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4")) 
names(mydata) <- "TEST"

以下是这样的:

> mydata
                                                                TEST
1 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4
2 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4
3 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4

以下是转换它的步骤:

library(dplyr)
library(tidyr)

1)按变量分开

mydata %>% 
separate(rows, into=paste0("Column", 1:4), sep=";")

输出:

          Column1          Column2          Column3          Column4
1 Column1 = blah1  Column2 = blah2  Column3 = blah3  Column4 = blah4
2 Column1 = blah1  Column2 = blah2  Column3 = blah3  Column4 = blah4
3 Column1 = blah1  Column2 = blah2  Column3 = blah3  Column4 = blah4

2)添加行标识符

mydata %>% 
  separate(TEST, into=paste0("Column", 1:4), sep=";") %>% 
  mutate(row=row.names(mydata))

输出:

          Column1          Column2          Column3          Column4 row
1 Column1 = blah1  Column2 = blah2  Column3 = blah3  Column4 = blah4   1
2 Column1 = blah1  Column2 = blah2  Column3 = blah3  Column4 = blah4   2
3 Column1 = blah1  Column2 = blah2  Column3 = blah3  Column4 = blah4   3

3)重新格式化为长

mydata %>% 
  separate(TEST, into=paste0("Column", 1:4), sep=";") %>% 
  mutate(row=row.names(mydata)) %>% 
  gather("key", "value", -row)

输出:

   row     key            value
1    1 Column1  Column1 = blah1
2    2 Column1  Column1 = blah1
3    3 Column1  Column1 = blah1
4    1 Column2  Column2 = blah2
5    2 Column2  Column2 = blah2
6    3 Column2  Column2 = blah2
7    1 Column3  Column3 = blah3
8    2 Column3  Column3 = blah3
9    3 Column3  Column3 = blah3
10   1 Column4  Column4 = blah4
11   2 Column4  Column4 = blah4
12   3 Column4  Column4 = blah4

4)然后提取数据

mydata %>% 
  separate(TEST, into=paste0("Column", 1:4), sep=";") %>% 
  mutate(row=row.names(mydata)) %>% 
  gather("key", "value", -row) %>% 
  extract(value, into="value", regex=".* = (.*)$")

输出:

   row     key value
1    1 Column1 blah1
2    2 Column1 blah1
3    3 Column1 blah1
4    1 Column2 blah2
5    2 Column2 blah2
6    3 Column2 blah2
7    1 Column3 blah3
8    2 Column3 blah3
9    3 Column3 blah3
10   1 Column4 blah4
11   2 Column4 blah4
12   3 Column4 blah4

5)如果需要,将其展开为宽格式

mydata %>% 
  separate(TEST, into=paste0("Column", 1:4), sep=";") %>% 
  mutate(row=row.names(mydata)) %>% 
  gather("key", "value", -row) %>% 
  extract(value, into="value", regex=".* = (.*)$") %>% 
  spread(key, value)

输出:

  row Column1 Column2 Column3 Column4
1   1   blah1   blah2   blah3   blah4
2   2   blah1   blah2   blah3   blah4
3   3   blah1   blah2   blah3   blah4

6)如果需要,删除行标识符

mydata %>% 
  separate(TEST, into=paste0("Column", 1:4), sep=";") %>% 
  mutate(row=row.names(mydata)) %>% 
  gather("key", "value", -row) %>% 
  extract(value, into="value", regex=".* = (.*)$") %>% 
  spread(key, value) %>% 
  select(-row)

输出:

  Column1 Column2 Column3 Column4
1   blah1   blah2   blah3   blah4
2   blah1   blah2   blah3   blah4
3   blah1   blah2   blah3   blah4

答案 1 :(得分:1)

这是一个基础尝试

#Example data provided
data <- data.frame(
 string=c(
  "Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4",
  "Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4"))

#Modulo function for odd and even numbers
odd <- function(x) x%%2 != 0 
even <- function(x) x%%2 == 0 

#split string based on condition and remove all xtra whitespace
s <- gsub("[[:space:]]", "", unlist(strsplit(as.character(data$string), '= |;')))

#bind the data into a df no factors
data <- data.frame(rbind(unique(s[even(1:length(s))]),
                   unique(s[even(1:length(s))])),
                   stringsAsFactors=F)
#rename column names exctrating the odd vector numbers from s
colnames(data) <- unique(s[odd(1:length(s))])

data