Question

我正在使用Kaggle's 2017 Data Science Survey Data，正在尝试查看专业的频率。人们使用X和Y格式输入双专业。（工程物理和医学）。以下是数据的一瞥：

> dput(head(major_free, 20))
c("biochemistry", "architecture", "economics", "engineering physics and medicine", 
"chemistry", "software engineering", "image processing research area", 
"applied mathematics", "biochemistry", "mechatronic engineering", 
"sound technology", "major-graphic design; minor- asian studies", 
"english literature and langauge", "bioinformatics", "biotechnology", 
"electronics and communication engineering", "chemistry", "electronic with image processing and ai", 
"geology", "software engineer")

> head(major_free)
[1] "biochemistry"                    
[2] "architecture"                    
[3] "economics"                       
[4] "engineering physics and medicine"
[5] "chemistry"                       
[6] "software engineering"

我想将双专业分成两个独立的专业（在数据框内）。我试过了：

strsplit(major_free, "and")

这给了我一个很长的列表，我不知道如何把它变成一个我可以用来绘制专业频率的数据帧。

2017/11/26编辑：

我想做同样的事情，但在“＆amp;”，“;”之前和之后分开等等

> major_free <- unlist(strsplit(major_free, "&"))
Error in strsplit(major_free, "&") : non-character argument
> class("&")
[1] "character"

奇怪的是R并没有将其作为strsplit中的角色来阅读。

Answer 1

怎么样？

li <- c("a", "a and b", "b", "b and c")
df <- stringr::str_split_fixed(li, " and ", 2)

根据数据，您可以添加df[complete.cases(df), ]之类的内容如果这没有用，请添加一个可重现的示例。

Answer 2

或者这个，（仅与@Christoph的答案不同的是使用base strsplit函数）：

 li <- c("a", "a and b", "b", "b and c")
 data.frame(majors = unlist(lapply(li, strsplit, " and " )))

Answer 3

下面的代码将采用您的基本strsplit解析和提供的示例数据，并为您提供一个data.frame，其中一列将双重专业分成一列中的两个观察值。

data.frame(major = unlist(strsplit(major_free, " and ")))

虽然仅根据您的示例数据发出警告，但您需要进行更多解析，如第13行所示

data.frame(major = unlist(strsplit(major_free, " and ")))[13,]
[1] major-graphic design; minor- asian studies

最后，如果您不想要特定的因素stringsAsFactors=FALSE

data.frame(major = unlist(strsplit(major_free, " and ")), stringsAsFactors=FALSE)

在R中的单词之前和之后拆分字符串

3 个答案: