我的csv中有一个列,其中包含字段“features”。这些字段包含这种格式的数据
{""Air conditioning"",""Elevator"",""Smoke detector""}
{""Air conditioning"",""Railing Lights"",""Smoke detector""}
{""Air conditioning"",""Washer"",""Dryer"",""Smoke detector""}
它们是20000条记录,这些字符串位于字段“功能”内,没有任何特定顺序。
如何将它们分成不同的列,使所有“空调”全部属于第1列,“电梯”属于第2列,依此类推。
a b c d
air conditioning elevators smokedetectors
air conditioning elevators smokedetectors washer
air conditioning elevators smokedetectors washer
答案 0 :(得分:0)
来自separate
的{{1}}和来自tidyr
的{{1}}(投放mutate_at
)的组合:
dplyr
给出
gsub
请注意,合并额外字段(如第三条记录中所示),请查看dfr <- data.frame(features = c('{""Air conditioning"",""Elevator"",""Smoke detector""}',
'{""Air conditioning"",""Railing Lights"",""Smoke detector""}',
'{""Air conditioning"",""Washer"",""Dryer"",""Smoke detector""}'))
library(tidyr)
library(dplyr)
# Remove {,}, and quotes (")
fix_txt <- function(x)gsub("[{]\"|\"|[}]", "", x)
separate(dfr, features, c("a","b","c"), sep=",", extra="merge") %>%
mutate_at(vars(a:c), fix_txt)
以获取更多选项。
答案 1 :(得分:0)
如前所述,您可以查看&#34; splitstackshape&#34;包,特别是cSplit_e
函数。有了它,你可以尝试:
library(splitstackshape)
cSplit_e(as.data.table(dfr)[, features := (gsub("[{}\"]", "", features))],
"features", ",", mode = "value", type = "character", drop = TRUE)
## features_Air conditioning features_Dryer features_Elevator features_Railing Lights features_Smoke detector features_Washer
## 1: Air conditioning NA Elevator NA Smoke detector NA
## 2: Air conditioning NA NA Railing Lights Smoke detector NA
## 3: Air conditioning Dryer NA NA Smoke detector Washer
&#34; dfr&#34;定义为@ Remko的答案:
dfr <- data.frame(features = c('{""Air conditioning"",""Elevator"",""Smoke detector""}',
'{""Air conditioning"",""Railing Lights"",""Smoke detector""}',
'{""Air conditioning"",""Washer"",""Dryer"",""Smoke detector""}'))