我的数据包含一个名为“状态”的列,其中包含多个逗号分隔的值,如下所示
test <- structure(list(states = c("WA", "SC", "IN", "IN", "WI", "NY",
"CA, CO, CT, DE, FL, GA, IA, ID, IL, IN, LA, MD, MI, MT, NJ, NV, OH, PA, SC, TX, UT, VA, WA",
"CA, CO, DE, GA, IL, LA, MA, MD, MI, MO, NJ, NV, NY, PA, VA, TX, WA",
"LA, MS", "DC, MD, VA", "AL, GA, NC", "MN WI", "MN WI", "KS, OK, TX",
"KS, MO, OK, TX", "IN, MI, NY, OH, PA", "CO, NE", "CO", "CO, NE",
"AZ, CA, CO, NV, TX, WA", "AZ, CA, NV, TX, UT,WA", "AZ, CA, NV, TX, UT, WA",
"CA, CT, IL, WA", "AL, AZ, CA, IL, MI, MO, MT, NJ, NM, OH, OK, PA, TX, VA, WI",
"AL, NC, TX, VA", "IL, MO, NJ, OH", "AZ, CA, CO, MN", "CO, IA, KY, TX",
"CO, IA, KY, MI, NC, NE, OH, PA, TX", "AR, GA, NC, NM, OK", "AL & WV",
"KY, MN, ND, OH,OR,PA", "KS", "AL, AR, AZ, CA, CT, DE, FL, GA, HI, IA, IL, IN, KS, KY, LA, MA, MD, MI, MN, MO, MS, NC, NE, NJ, NM, NY, OH, OK, OR, PA, RI, SC, TN, TX, UT, VA, WI",
"AR, CO, GA, IL, LA, MI, MN, MS, MT, NC, ND, NE, OH, PA, RI, SC, TX, WI",
"AL, AR, AZ, CA, CT, DE, FL, GA, HI, IA, IL, IN, KS, KY, LA, MA, MD, MI, MN, MO, MS, NC, NE, NJ, NM, NY, OH, OK, OR, PA, RI, SC, TN, TX, UT, VA, WI",
"AL, AR, AZ, CA, CT, DE, FL, GA, HI, IA, IL, IN, KS, KY, LA, MA, MD, MI, MN, MO, MS, NC, NE, NJ, NM, NY, OH, OK, OR, PA, RI, SC, TN, TX, UT, VA, WI",
"AL, AZ, FL, KS, MI, MN, MO, NC, OK, WI", "GA, SC", "CA, CO, FL, IL, KY, NJ, OH, TX, VA",
"AL, AZ, CA, FL, GA, NJ, NM, NV, OH, PA, TX, VA", "ALL 50 STATES",
"ALL 50 STATES", "ALL 50 STATES", "AL, AZ, FL, GA, MI, NJ, NY, OH, OR, PA, TX, UT"
)), .Names = "states", row.names = c(NA, -45L), class = c("tbl_df",
"tbl", "data.frame"))
test
我想将其转换为一种格式,该格式将每个“状态”列为一列,并以1表示状态存在,否则为零。
谢谢
答案 0 :(得分:2)
这可能就是您想要的。由于您未提供预期的输出,因此这是我根据您的描述进行的解释。这个想法是用rowid_to_column
添加索引,用“ ALL”替换“ ALL 50 STATES”,用separate_rows
分隔基于符号和空格的状态,然后spread
数据帧。
library(tidyverse)
test2 <- test %>%
# Create index
rowid_to_column() %>%
# Replace ALL 50 STATES with ALL
mutate(states = replace(states, states %in% "ALL 50 STATES", "ALL")) %>%
# Separate states with punct and space
separate_rows(states, sep = "[[:punct:][:space:]]+") %>%
group_by(rowid) %>%
mutate(Group_ID = row_number(), Present = 1L) %>%
spread(states, Present, fill = 0L) %>%
select(-Group_ID)
答案 1 :(得分:1)
首先,我加载库。
# Load libraries
library(dplyr)
library(magrittr)
library(datasets)
接下来,我用所有50个州的缩写替换您数据集中的ALL 50 STATES
。 (state.abb
来自datasets
包。)
# Change "ALL 50 STATES" to state abbreviations
test %<>%
mutate(states = ifelse(states == "ALL 50 STATES", paste0(state.abb, collapse = ","), states))
最后,我遍历每个元素,使用strsplit
解析状态,使用table
计数每个状态,使用bind_rows
将结果绑定到一个数据帧中,然后替换{ {1}}与NA
和replace_na
为零。
mutate_all
[ N.B。。您的数据集有些混乱:大多数状态用逗号分隔,但有些状态只有空格或“&”号。我曾经使用# Count assuming state only can appear once per row
do.call(bind_rows, lapply(test$states, function(x)table(strsplit(x, "[[:punct:][:space:]]+")))) %>%
mutate_all(replace_na, replace = 0)
来说明所有这些可能性。]
这是前10行和前10个状态的示例:
[[:punct:][:space:]]+