Question

我有来自包裹R {tm}的这个文件术语矩阵，我已经强制转移到as.matrix。 MWE在这里：

> inspect(dtm[1:ncorpus, intersect(colnames(dtm), thai_list)])
<<DocumentTermMatrix (documents: 15, terms: 4)>>
Non-/sparse entries: 17/43
Sparsity           : 72%
Maximal term length: 12
Weighting          : term frequency (tf)

Terms
Docs toyota_suv gmotors_suv ford_suv nissan_suv
1      0       1       0            0
2      0       1       0            0
3      0       1       0            0
4      0       2       0            0
5      0       4       0            0
6      1       1       0            0
7      1       1       0            0
8      0       1       0            0
9      0       1       0            0
10     0       1       0            0

我需要对此as.matrix(dtm)进行子集化，这样我只能获得引用toyota_suv但没有其他车辆的文档（行）。我使用dmat<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota_suv")])得到了一个术语（toyota_suv）的子集，效果很好。如何设置查询：toyota_suv非零但非toyota_suv列的值为零的文档？我可以将列指定为==0，但此矩阵是动态生成的。在一些市场中，可能有四辆汽车，在某些市场可能有十辆。我不能预先指定colnames。如何（动态）将所有非toyota_suv列添加为零，如all_others == 0？任何帮助将不胜感激。

Answer 1

您可以通过获取toyota_suv的索引位置，然后将dtm子集与非零值匹配来完成此操作，并对所有其他列使用相同索引变量的负索引来确保它们都是零。

在这里，我稍微修改了您的dtm，以便toyota_sub非零的两种情况符合您要查找的条件（因为您的示例中没有符合条件）：

dtm <- read.table(textConnection("
toyota_suv gmotors_suv ford_suv nissan_suv
      0       1       0            0
      0       1       0            0
      0       1       0            0
      0       2       0            0
      0       4       0            0
      1       0       0            0
      1       0       0            0
      0       1       0            0
      0       1       0            0
      0       1       0            0"), header = TRUE)

然后它起作用：

# get the index of the toyota_suv column
index_toyota_suv <- which(colnames(dtm) == "toyota_suv")

# select only cases where toyota_suv is non-zero and others are zero
dtm[dtm[, "toyota_suv"] > 0 & !rowSums(dtm[, -index_toyota_suv]), ]
##   toyota_suv gmotors_suv ford_suv nissan_suv
## 6          1           0        0          0
## 7          1           0        0          0

注意：这根本不是一个文本分析问题，而是一个关于矩阵对象子集的问题。

Answer 2

如果您提供了正在运行的确切代码以及要使用的示例数据集，那将非常有用，这样我们就可以复制您的工作并提供一个工作示例。

鉴于此，如果我正确理解您的问题，您正在寻找一种方法将所有非丰田列标记为零。你可以尝试：

df[colnames(df) != "toyota"] <- 0

子集矩阵，寻址colnames

2 个答案: