我有一个非常大的数据框,有两列名为sentence1
和sentence2
。
我正在尝试使用两个句子之间不同的单词创建一个新列,例如:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))
我的数据框架具有以下结构:
ID sentence1 sentence2
1 This is sentence one This is the sentence four
2 This is sentence two This is the sentence five
3 This is sentence three This is the sentence six
我的预期结果是:
ID sentence1 sentence2 Expected_Result
1 This is ... This is ... one the four
2 This is ... This is ... two the five
3 This is ... This is ... three the six
在R中,我试图分割句子,然后获得列表之间不同的元素,如:
df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)
但是这种方法在应用setdiff
...
在Python中,我试图应用NLTK,尝试先获取令牌,然后提取两个列表之间的差异,如:
from nltk.tokenize import word_tokenize
df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))
此时我找不到能给我所需结果的功能..
我希望你能帮助我。感谢
答案 0 :(得分:3)
这是一个R解决方案。
我创建了一个exclusiveWords
函数,可以找到两个集之间的唯一单词,并返回一个&#39;句子&#39;由这些词组成。我已将其包装在Vectorize()
中,以便它可以同时处理data.frame的所有行。
df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)
exclusiveWords <- function(x, y){
x <- strsplit(x, " ")[[1]]
y <- strsplit(y, " ")[[1]]
u <- union(x, y)
u <- union(setdiff(u, x), setdiff(u, y))
return(paste0(u, collapse = " "))
}
exclusiveWords <- Vectorize(exclusiveWords)
df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
# sentence1 sentence2 result
# 1 This is sentence one This is the sentence four the four one
# 2 This is sentence two This is the sentence five the five two
# 3 This is sentence three This is the sentence six the six three
答案 1 :(得分:3)
与@SymbolixAU作为应用函数的答案基本相同。
df$Dif <- apply(df, 1, function(r) {
paste(setdiff(union (unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']])),
intersect(unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']]))),
collapse = " ")
})
答案 2 :(得分:1)
在Python中,您可以构建一个函数,将句子中的单词视为一个集合,并计算集合理论排除'或'(一组句子在一个句子中而不在另一个句子中):
df.apply(lambda x:
set(word_tokenize(x['sentence1'])) \
^ set(word_tokenize(x['sentence2'])), axis=1)
结果是集合的数据框。
#0 {one, the, four}
#1 {the, two, five}
#2 {the, three, six}
#dtype: object