Question

我有一个非常大的数据框，有两列名为sentence1和sentence2。我正在尝试使用两个句子之间不同的单词创建一个新列，例如：

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))

我的数据框架具有以下结构：

ID    sentence1                    sentence2
 1     This is sentence one         This is the sentence four
 2     This is sentence two         This is the sentence five
 3     This is sentence three       This is the sentence six

我的预期结果是：

ID    sentence1        sentence2     Expected_Result
 1     This is ...      This is ...   one the four 
 2     This is ...      This is ...   two the five
 3     This is ...      This is ...   three the six

在R中，我试图分割句子，然后获得列表之间不同的元素，如：

df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

但是这种方法在应用setdiff ...

时不起作用

在Python中，我试图应用NLTK，尝试先获取令牌，然后提取两个列表之间的差异，如：

from nltk.tokenize import word_tokenize

df['tokensS1'] = df.sentence1.apply(lambda x:  word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x:  word_tokenize(x))

此时我找不到能给我所需结果的功能..

我希望你能帮助我。感谢

Answer 1

这是一个R解决方案。

我创建了一个exclusiveWords函数，可以找到两个集之间的唯一单词，并返回一个＆＃39;句子＆＃39;由这些词组成。我已将其包装在Vectorize()中，以便它可以同时处理data.frame的所有行。

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)

exclusiveWords <- function(x, y){
    x <- strsplit(x, " ")[[1]]
    y <- strsplit(y, " ")[[1]]
    u <- union(x, y)
    u <- union(setdiff(u, x), setdiff(u, y))
    return(paste0(u, collapse = " "))
}

exclusiveWords <- Vectorize(exclusiveWords)

df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
#                sentence1                 sentence2        result
# 1   This is sentence one This is the sentence four  the four one
# 2   This is sentence two This is the sentence five  the five two
# 3 This is sentence three  This is the sentence six the six three

Answer 2

与@SymbolixAU作为应用函数的答案基本相同。

df$Dif  <-  apply(df, 1, function(r) {
  paste(setdiff(union    (unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']])),
                intersect(unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']]))), 
        collapse = " ")
})

Answer 3

在Python中，您可以构建一个函数，将句子中的单词视为一个集合，并计算集合理论排除'或'（一组句子在一个句子中而不在另一个句子中）：

df.apply(lambda x:  
            set(word_tokenize(x['sentence1'])) \
          ^ set(word_tokenize(x['sentence2'])), axis=1)

结果是集合的数据框。

#0     {one, the, four}
#1     {the, two, five}
#2    {the, three, six}
#dtype: object

提取两个句子之间不同的单词

3 个答案: