Question

我有两个数据集：

包含国家名称的10 * 1矩阵：

Build finished. No test results found.

一个20 * 2的基质，含有3克和3克的ids：

countries<-structure(
  c("usa", "canada", "france", "england", "brazil",
    "spain", "germany", "italy", "belgium", "switzerland"),
  .Dim = c(10L,1L))

我想循环这些国家/地区，并为每一行获取该国家/地区中存在的tri_grams。例如在巴西有＆＃34; br＆＃34;和＆＃34; il＆＃34;。我想得到的信息:(国家的索引（双），三克（char）的id）。因此，对于巴西我想得到：（5，＆＃34; 49＆＃34;）和（5，＆＃34; 25＆＃34;）。

以下是带有简单循环的代码：

tri_grams<-    structure(
  c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
    "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
    "mo", "an", "ce", "ko", "we", "ge", "ma", "fi", "br", "ca",
    "gi", "po", "ro", "ch", "ru", "tz", "il", "sp", "ai", "jo"), 
  .Dim = c(20L,2L),
  .Dimnames = list(NULL, c("id", "triGram")))

效果很好，结果如下：

res <- matrix(ncol=2,nrow=nrow(countries)*nrow(tri_grams))
colnames(res) <- c("indexCountry","idTriGram")
k <- 0

for(i in 1:nrow(countries))
{
  for(j in 1:nrow(tri_grams))
  {
    if(grepl(tri_grams[j,2],countries[i,1])==TRUE)
    {
      k <- k+1
      res[k,1] <- i
      res[k,2] <- tri_grams[j,1]
    }
  }
}
res <- res[1:k,]

我希望得到相同的结果但是使用apply。我实际上有一个庞大的数据集，这只是我真实数据集的一个示例。当我在我的真实数据集上使用简单循环方法时，它需要很长时间（超过10小时）。我尝试使用apply来编写它但我没有成功。

Answer 1

我不知道这到底有多快，但这至少是一种简洁的方法来获得相同的结果。

x<-which(outer(tri_grams[,"triGram"],countries,Vectorize(grepl))[,,1],arr.ind=TRUE)
cbind(country=x[,2],trigram=x[,1])

     country trigram
 [1,]       2       2
 [2,]       2      10
 [3,]       3       2
 [4,]       3       3
 [5,]       4       2
 [6,]       5       9
 [7,]       5      17
 [8,]       6      18
 [9,]       6      19
[10,]       7       2
[11,]       7       6
[12,]       7       7
[13,]       9      11
[14,]      10       2
[15,]      10      16

需要帮助使用更新我的简单循环代码到更快的代码使用apply（R）

1 个答案: