Question

我在csv文件中有来自客户服务的客户查询和答案。我需要确定每个问题的主题，然后再开发一个分类模型。我创建了两个文档术语矩阵（清理文档后），一个用于提问，另一个用于答案。我通过仅在整个文档中使用超过400次的术语来减小大小（大约40k问题和答案）。

我想创建一个数据框，按行合并这两个矩阵，只保留常见的单词并回答dtm（并加上它们的频率。我应该怎样在R？I＆＃39; ll使用最高频率的词来标记问题。

对此方法的任何帮助/建议都非常感谢。

> str(inspect(dtmaf))
<<DocumentTermMatrix (documents: 38697, terms: 237)>>
Non-/sparse entries: 326124/8845065
Sparsity           : 96%
Maximal term length: 13
Weighting          : term frequency (tf)
Sample             :
   Terms
Docs    booking card change check confirm confirmation email make port wish
12316       3    1      0     0       0            0     0    0    1    1
137         4    1      2     0       1            0     0    0    0    0
17618       4    1      0     0       0            0     0    2    0    2
18082       2    1      3     1       1            0     0    0    1    0
19141       3    0      2     0       1            0     0    0    1    0
21862       2    0      0     0       0            0     0    1    0    0
2756        1    0      2     0       0            0     0    1    0    1
27578       2    1      5     0       0            0     0    0    0    1
30312       4    1      2     0       0            0     0    2    0    2
9019        1    1      1     0       0            0     0    0    0    0
num [1:10, 1:10] 3 4 4 2 3 2 1 2 4 1 ...
- attr(*, "dimnames")=List of 2
 ..$ Docs : chr [1:10] "12316" "137" "17618" "18082" ...
 ..$ Terms: chr [1:10] "booking" "card" "change" "check" ...

> str(inspect(dtmc))
<<DocumentTermMatrix (documents: 38697, terms: 189)>>
Non-/sparse entries: 204107/7109626
Sparsity           : 97%
Maximal term length: 13
Weighting          : term frequency (tf)
Sample             :
       Terms
Docs    booking car change confirmation like number possible reservation return ticket
  14091       0   0      0            0    2      0        0           2      0      0
  18220       6   0      0            2    0      0        0           0      0      0
  20103       1   0      1            0    0      1        0           0      0      0
  20184       0   3      0            0    0      1        0           4      1      0
  21005       3   5      0            1    2      0        1           0      0      0
  24877       0   1      1            0    0      0        0           2      0      1
  26135       0   0      0            0    0      0        0           1      0      0
  28200       5   2      1            0    0      0        0           1      0      0
  2979       12   7      2            0    1      0        0           0      0      0
  680         0   0      1            2    0      1        0           0      0      0
 num [1:10, 1:10] 0 6 1 0 3 0 0 5 12 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ Docs : chr [1:10] "14091" "18220" "20103" "20184" ...
  ..$ Terms: chr [1:10] "booking" "car" "change" "confirmation" ...

预期输出是具有（237 + 189）项和38697行的矩阵。两个dtms中的匹配项将在每个项中有一列，并且它们的频率总和，并且非匹配项将如此复制。

这是一个包含10个文档的可重现示例：

> dput(datamsg)
structure(list(cmessage = c("No answer ?", "Hello  the third  number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number  can not be found in the system. Therefore I request to return money. It was not my fault !", 
"Hi  I forget probably choose items on the   How can I do this now.  ", 
"Hi  I forget probably choose items  How can i do this now.  ", 
"Hello  I tell if I have booked . If not  is it possible and what would it cost? ", 
"First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ", 
"Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you.  But rather ask more questions. ", 
"Dear  booked everything again. Also the journey through In my previous message  I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ", 
"Thank you. When will the new  registration show ?...as it still shows the . Thanks", 
"So my phone number is .Please tell me how this works."), afreply = c("Hello   afraid there is no space on the September. I have also checked but  are all fully booked. Would you like us to check any other dates for you? ", 
"Hello  As far as we can see the booking No was a valid reservation. We have however contacted  and can confirm that administration fee  was refunded back to your card. ", 
"Good afternoon  You are currently booked as high plane. You have requested an amendment to change the height   which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request   please submit a new one with an accurate height ofreply to this message. ", 
"Hello  thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking  please contact us.", 
"Hello  you booked any  In order to make a change to your booking  kindly send us a amendment request via", 
"Dear Mr. what dimensions  you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested  if you call us an alternative travel date.", 
"Dear Sir or Madam  we will send you the address ", "Hello  your crossing with was already refunded. As my colleague told you your  with  was still valid. In case you have booked a second ticket with   please send us the new booking reference number  but we cannot guarantee that you will be entitle to a refund. ", 
"if you can authorise us to take the payment from the card you used to make the we can then make the change.", 
"Good morning  we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. "
)), .Names = c("cmessage", "afreply"), class = "data.frame", row.names = c(NA, 
-10L))

corpus1<-Corpus(VectorSource(datamsg$cmessage))
corpus2<-Corpus(VectorSource(datamsg$afreply))
dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf))
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf))

Answer 1

您的代码：

#dput(datamsg)
datamsg <-
        structure(
                list(
                        cmessage = c(
                                "No answer ?",
                                "Hello  the third  number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number  can not be found in the system. Therefore I request to return money. It was not my fault !",
                                "Hi  I forget probably choose items on the   How can I do this now.  ",
                                "Hi  I forget probably choose items  How can i do this now.  ",
                                "Hello  I tell if I have booked . If not  is it possible and what would it cost? ",
                                "First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ",
                                "Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you.  But rather ask more questions. ",
                                "Dear  booked everything again. Also the journey through In my previous message  I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ",
                                "Thank you. When will the new  registration show ?...as it still shows the . Thanks",
                                "So my phone number is .Please tell me how this works."
                        ),
                        afreply = c(
                                "Hello   afraid there is no space on the September. I have also checked but  are all fully booked. Would you like us to check any other dates for you? ",

                                "Hello  As far as we can see the booking No was a valid reservation. We have however contacted  and can confirm that administration fee  was refunded back to your card. ",
                                "Good afternoon  You are currently booked as high plane. You have requested an amendment to change the height   which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request   please submit a new one with an accurate height ofreply to this message. ",
                                "Hello  thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking  please contact us.",
                                "Hello  you booked any  In order to make a change to your booking  kindly send us a amendment request via",
                                "Dear Mr. what dimensions  you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested  if you call us an alternative travel date.",
                                "Dear Sir or Madam  we will send you the address ",
                                "Hello  your crossing with was already refunded. As my colleague told you your  with  was still valid. In case you have booked a second ticket with   please send us the new booking reference number  but we cannot guarantee that you will be entitle to a refund. ",
                                "if you can authorise us to take the payment from the card you used to make the we can then make the change.",
                                "Good morning  we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. "
                        )
                ),
                .Names = c("cmessage", "afreply"),
                class = "data.frame",
                row.names = c(NA,-10L)
        )

corpus1<-Corpus(VectorSource(datamsg$cmessage)) # 10 docs
corpus2<-Corpus(VectorSource(datamsg$afreply)) # 10 docs


dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf))
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf))

我的代码继续：

library(tm)
library(dplyr)
library(stringr)
# rename anonymous document ids:
rownames(dtmc) <- dtmc %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .)
rownames(dtmaf) <- dtmaf %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .)

# transform to termDocumentmatrix
tdmc <- dtmc  %>% t() 
tdmaf<- dtmaf %>% t()

# introduce new first column "word"
tdmc_df  <- tdmc  %>% as.matrix() %>%  as.data.frame() %>% rownames_to_column( var = "word")
tdmaf_df <- tdmaf %>% as.matrix() %>%  as.data.frame() %>% rownames_to_column( var = "word")

# find common words
tdm_df <- tdmc_df %>% inner_join(tdmaf_df, by=c("word"))  
tdm_df <- tdm_df  %>% arrange(word)
dtm_df <- tdm_df  %>% column_to_rownames("word") %>% t()


# count occurences of matching words
colSums(dtm_df)

# find nonmatching words
dtm_df_nonmatching <- tdmc_df %>% anti_join(tdmaf_df, by=c("word"))  %>% arrange(word) %>% column_to_rownames("word")

# count occurences of nonmatching words
rowSums(dtm_df_nonmatching)

常用词，计数：

 colSums(dtm_df)
 address     also      and   booked      but      can     card     dear      for     from     have    hello  message 
       4        2        5        7        3       13        3        3        4        2       12        8        3 
    more      new      not   number      pay   please possible  request    still   thanks     that      the     then 
       2        3        8        4        2        5        2        3        2        2        3       32        3 
    this     told   travel      was     what     will     with    would      you 
       6        2        2        5        2        4        7        2       25

Answer 2

使用 quanteda 包时，这是一种更简单的方式。

library("quanteda")
packageVersion("quanteda")
# [1] ‘0.99.9’

首先，我们创建两个文档特征矩阵，并找出它们的常用术语：

dfm_c <- dfm(datamsg$cmessage, remove_punct = TRUE)
dfm_af <- dfm(datamsg$afreply, remove_punct = TRUE)
common_feature_names <- intersect(featnames(dfm_c), featnames(dfm_af))

然后我们可以使用cbind()组合它们，它们（正确地）发出警告，表示您现在有重复的功能。第二行只选择公共特征，第三行通过对它们求和来组合dfm中具有相同名称的特征，这就是你想要的。

combined_dfm <- cbind(dfm_c, dfm_af) %>%
    dfm_select(pattern = common_feature_names) %>%
    dfm_compress()
head(combined_dfm)
# Document-feature matrix of: 6 documents, 6 features (41.7% sparse).
# 6 x 6 sparse Matrix of class "dfmSparse"
#        features
# docs    no hello the number is i
#   text1  2     1   1      0  1 1
#   text2  1     2   6      2  1 2
#   text3  0     0   3      0  0 2
#   text4  0     1   0      0  0 3
#   text5  0     2   0      0  1 2
#   text6  0     0   3      0  1 2

如果你确实想要它回到 tm ，你可以使用以下方式转换它：

convert(combined_dfm, to = "tm")
# <<DocumentTermMatrix (documents: 10, terms: 49)>>
# Non-/sparse entries: 189/301
# Sparsity           : 61%
# Maximal term length: 8
# Weighting          : term frequency (tf)

注意：您尚未明确指出可能需要将dfm与不同的文档合并，因此我假设（从示例中）文档是相同的。如果它们不同，那也很容易解决，但问题中没有说明。

按行

2 个答案: