我有1个长字符串(每行)内的数据。基本上它以分号分隔,列/答案由=
分隔。我试图做以下事情:
目前的结构:
Row1: “Column1 = blah1;Column2 = blah2;Column3 = blah3;Column4 = blah4”
Row2: “Column1 = blah1;Column2 = blah2;Column3 = blah3;Column4 = blah4”
转换为 - >
Column1|Column2|Column3|Column4
blah1|blah2|blah3|blah4
blah1|blah2|blah3|blah4
我相信R中的tidyr
套餐是要走的路,但我还没能弄明白。
我已经使用tidyr
,但我仍然遇到错误:
# CREATE TEST DATA
mydata <- as.data.frame(c("Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4"))
names(mydata) <- "TEST"
# Create dummy vector
x <- vector(mode="numeric", length=0)
# Separate by ;
x <- separate(mydata, TEST, x, sep = ";" )
非常感谢任何帮助。
答案 0 :(得分:2)
我将使用dplyr pipes来逐步显示如何执行此操作,在每个步骤后打印输出,以便您了解数据结构如何演变。
mydata <- as.data.frame(c("Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4","Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4"))
names(mydata) <- "TEST"
以下是这样的:
> mydata
TEST
1 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4
2 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4
3 Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4
以下是转换它的步骤:
library(dplyr)
library(tidyr)
1)按变量分开
mydata %>%
separate(rows, into=paste0("Column", 1:4), sep=";")
输出:
Column1 Column2 Column3 Column4
1 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4
2 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4
3 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4
2)添加行标识符
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata))
输出:
Column1 Column2 Column3 Column4 row
1 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4 1
2 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4 2
3 Column1 = blah1 Column2 = blah2 Column3 = blah3 Column4 = blah4 3
3)重新格式化为长
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row)
输出:
row key value
1 1 Column1 Column1 = blah1
2 2 Column1 Column1 = blah1
3 3 Column1 Column1 = blah1
4 1 Column2 Column2 = blah2
5 2 Column2 Column2 = blah2
6 3 Column2 Column2 = blah2
7 1 Column3 Column3 = blah3
8 2 Column3 Column3 = blah3
9 3 Column3 Column3 = blah3
10 1 Column4 Column4 = blah4
11 2 Column4 Column4 = blah4
12 3 Column4 Column4 = blah4
4)然后提取数据
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row) %>%
extract(value, into="value", regex=".* = (.*)$")
输出:
row key value
1 1 Column1 blah1
2 2 Column1 blah1
3 3 Column1 blah1
4 1 Column2 blah2
5 2 Column2 blah2
6 3 Column2 blah2
7 1 Column3 blah3
8 2 Column3 blah3
9 3 Column3 blah3
10 1 Column4 blah4
11 2 Column4 blah4
12 3 Column4 blah4
5)如果需要,将其展开为宽格式
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row) %>%
extract(value, into="value", regex=".* = (.*)$") %>%
spread(key, value)
输出:
row Column1 Column2 Column3 Column4
1 1 blah1 blah2 blah3 blah4
2 2 blah1 blah2 blah3 blah4
3 3 blah1 blah2 blah3 blah4
6)如果需要,删除行标识符
mydata %>%
separate(TEST, into=paste0("Column", 1:4), sep=";") %>%
mutate(row=row.names(mydata)) %>%
gather("key", "value", -row) %>%
extract(value, into="value", regex=".* = (.*)$") %>%
spread(key, value) %>%
select(-row)
输出:
Column1 Column2 Column3 Column4
1 blah1 blah2 blah3 blah4
2 blah1 blah2 blah3 blah4
3 blah1 blah2 blah3 blah4
答案 1 :(得分:1)
这是一个基础尝试
#Example data provided
data <- data.frame(
string=c(
"Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4",
"Column1 = blah1; Column2 = blah2; Column3 = blah3; Column4 = blah4"))
#Modulo function for odd and even numbers
odd <- function(x) x%%2 != 0
even <- function(x) x%%2 == 0
#split string based on condition and remove all xtra whitespace
s <- gsub("[[:space:]]", "", unlist(strsplit(as.character(data$string), '= |;')))
#bind the data into a df no factors
data <- data.frame(rbind(unique(s[even(1:length(s))]),
unique(s[even(1:length(s))])),
stringsAsFactors=F)
#rename column names exctrating the odd vector numbers from s
colnames(data) <- unique(s[odd(1:length(s))])
data