我有两列“标题”,其中包含“什么是物理?”等数据。另一栏“内容”包含“物理学是......的研究”等数据。 我想要像['是','物理']这样的共同文本。 必须对所有数据行执行此操作。如何使用R?
实现这一目标此致
答案 0 :(得分:1)
我认为您需要以下内容:
df <- data.frame(col1=c('what is physics?', 'set cover is NP hard', 'abstract algebra'),
col2=c('Physics is the study of...', 'Example of an NP complete problem is 3-SAT', 'linear algebra'),
stringsAsFactors = FALSE)
# col1 col2
# 1 what is physics? Physics is the study of...
# 2 set cover is NP hard Example of an NP complete problem is 3-SAT
# 3 abstract algebra linear algebra
apply(df, 1, function(x) intersect(tolower(unlist(strsplit(gsub('[^a-zA-Z\\s]+', ' ', x[1]), split=' '))),
tolower(unlist(strsplit(gsub('[^a-zA-Z\\s]+', ' ', x[2]), split=' ')))))
#[[1]]
#[1] "is" "physics"
#[[2]]
#[1] "is" "np"
#[[3]]
#[1] "algebra"