搜索R中两个因子水平的特定相互作用

时间:2013-05-01 15:37:39

标签: r dataframe interaction r-factor

我正在寻找一种方法来搜索构成数据帧行的因子级别之间的特定交互形式。

我有一个数据框,例如这个数据框,其中每列都是一个人,每行都是一个观察点:

     A   B   C   D   E   G   H   I  
1   NA  "1" "1" "1" "1" NA  "1" "1"
2   "2" "1" "2" "1" "1" NA  "1" "1"
3   "1" "2" "2" "1" "1" "1" "1" "2"
4   "1" "2" "2" "2" "3" "3" "4" "2"
5   "1" "1" "2" "2" "1" "2" "1" "2"

我想要检测的是因子水平组合的存在(或不存在),例如x:x' and x:y'也存在组合y:x' and y:y'。例如,这里有第2行和第3行的组合,我可以使用interaction:看到:

> df <- structure(c(NA, "2", "1", "1", "1", "1", "1", "2", "2", "1", 
"1", "2", "2", "2", "2", "1", "1", "1", "2", "2", "1", "1", "1", 
"3", "1", NA, NA, "1", "3", "2", "1", "1", "1", "4", "1", "1", 
"1", "2", "2", "2"), .Dim = c(5L, 8L), .Dimnames = list(c("1", 
"2", "3", "4", "5"), c("A", "B", "C", "D", "E", "G", "H", "I")))
> interaction(df["2",],df["3",])
[1] 2.1  1.2  2.2  1.1  1.1  <NA> 1.1  1.2 
Levels: 1.1 2.1 1.2 2.2

以及:

> as.factor(df["2",]):as.factor(df["3",])
[1] 2:1  1:2  2:2  1:1  1:1  <NA> 1:1  1:2 
Levels: 1:1 1:2 2:1 2:2

但是,现在,我希望检测能够自动完成,这样我就可以将所有行对的标签放在数据框中,其中这样的配置(x:y,x:y',x' :y,x':y')被检测到我想要绘制的网络的边缘列表(例如,我想在边缘列表中添加行"2","3")。

我已经找到了一种使用Perl和正则表达式的精心设计方法,但我想知道是否存在一种在R中执行此操作的方法,而不使用Regexp。

修改 [04/05/2013]

为避免不清楚,以下是有关我正在寻找的配置的更多详细信息:

let {x,y,...} be observations of the first row
let {x',y',...} be observations of the second row
for interactions ({x,x'} and {x,y'}) does it exists interactions ({y,x'} and {y,y'})

所以,举一些例子,例如:

1:1, 1:2, 2:1, 2:2 (rows 2 and 3)

1:1, **2:1**, **2:2**, **3:1**, **3:2**, 4:1 (rows 4 and 5)

会匹配,但

1:1,1:2,1:3,1:4, 2:2 (rows 3 and 4)

1:1,1:2 (rows 1 and 2)
例如

我现在所拥有的是一个代码,可以在很长的时间内完成我想做的事情(模仿以前的Perl脚本)(即使我添加了一个while循环以避免不必要的比较),并使用多个循环和正则表达式。我希望通过一种不那么不必要的复杂方式来进行这种比较。我现在就是这样做的:

df <- structure(c(NA, "2", "1", "1", "1", "1", "1", "2", "2", "1", 
"1", "2", "2", "2", "2", "1", "1", "1", "2", "2", "1", "1", "1", 
"3", "1", NA, NA, "1", "3", "2", "1", "1", "1", "4", "1", "1", 
"1", "2", "2", "2"), .Dim = c(5L, 8L), .Dimnames = list(c("1", 
"2", "3", "4", "5"), c("A", "B", "C", "D", "E", "G", "H", "I")))

"myfunction" = function(x){
    TableVariantes = as.matrix(x) ;
    #Creating the edgelist for the network
    edgelist = c(character(0),character(0)); 
    TotalVL = nrow(TableVariantes);

    for(i in 1:(TotalVL-1)){
        VLA = i;
        if(!(i+1) > TotalVL){
            for(j in (i+1):TotalVL){
                VLB = j ;
                problematic.configuration = FALSE;
                #False until proven otherwise
                interactions = interaction(as.factor(TableVariantes[VLA,]):as.factor(TableVariantes[VLB,]),drop=TRUE);
                if(nlevels(as.factor(interactions)) > 3){ 
                    #More than three configurations, let's go
                    #Testing every level of the first variant location
                    for(k in levels(as.factor(TableVariantes[VLA,]))){
                        # We create the regexp we will need afterwards. Impossible to use variables inside a regex in R.
                        searchforK = paste(k,":(.+)",sep="") 
                        if (length(grep(searchforK,levels(interactions), ignore.case = TRUE, perl = TRUE)) > 1){
                           #More than one configuration for this level of the first row
                           #capturing corresponding observations of the second row
                           second.numbers = regexec(searchforK,levels(interactions), ignore.case = TRUE)
                           second.numbers = do.call(rbind,lapply(regmatches(levels(interactions),second.numbers),`[`))
                           #Interactions with first number other than the one we are testing                           
                           invert.matches = grep(searchforK,levels(interactions), ignore.case = TRUE, perl = TRUE, value=TRUE, invert=TRUE)
                           #listing these alternative first numbers
                           alternative.first.numbers = regexec("(.+?):.+",levels(as.factor(invert.matches)), ignore.case = TRUE)
                           alternative.first.numbers = do.call(rbind,lapply(regmatches(levels(as.factor(invert.matches)),alternative.first.numbers),`[`))
                           #testing each alternative first number
                           for(l in levels(as.factor(alternative.first.numbers[,2]))){
                               #variable problems to count the problematic configurations
                               problems = 0 ;
                               #with each alternative second number
                               for(m in levels(as.factor(second.numbers[,2]))){
                                   searchforproblem = paste(l,":",m,sep="");
                                   if(length(grep(searchforproblem,invert.matches,ignore.case = TRUE, perl = TRUE)) > 0){
                                       #if it matches
                                       problems = problems + 1;
                                   }
                                   if(problems > 1){
                                       #If two possibilities at least
                                       problematic.configuration = TRUE;
                                   }
                               }
                           }
                        }
                    }
                }

            if(problematic.configuration == TRUE){
                edgelist = rbind(edgelist,c(rownames(TableVariantes)[VLA],rownames(TableVariantes)[VLB]));
                #adding a new edge to the network of conflicts !
            }
            }
        }
    }
    return(edgelist);
}

1 个答案:

答案 0 :(得分:1)

您可以使用dput()函数为您的问题提供示例数据。

df <- structure(list(A = c("1", "2", "2", "1", "1", "1", NA, "2", "1", 
    "2"), B = c(NA, "2", "2", "2", "2", "1", "2", "2", "1", NA), 
    C = c("1", "2", "1", "1", NA, "1", NA, "2", "2", NA), D = c(NA, 
    NA, "2", "1", NA, "1", NA, "1", "1", NA), E = c(NA, NA, NA, 
    "2", "1", NA, "1", "2", NA, "1"), H = c(NA, NA, "1", "2", 
    NA, "1", "2", "2", NA, "1"), I = c(NA, NA, NA, NA, NA, NA, 
    "1", "1", NA, "2"), J = c("2", "1", "2", "1", "1", "2", NA, 
    "2", NA, "2"), K = c("1", "1", NA, "1", "2", "1", NA, "1", 
    "1", "1"), O = c("2", "2", "1", "2", "1", "1", NA, "2", "1", 
    NA)), .Names = c("A", "B", "C", "D", "E", "H", "I", "J", 
    "K", "O"), row.names = c(NA, -10L), class = "data.frame")

我假设您有兴趣发现哪些观察对(行)在个体(列)之间具有四个独特的交互级别。这是一种使用for循环的方法。

# convert your data frame to a matrix
m <- as.matrix(df)

# create another matrix to store the results
N <- dim(m)[1]
levelsmat <- matrix(NA, nrow=(N*N - N)/2, ncol=3, 
    dimnames=list(NULL, c("i", "j", "nlevels")))

# go through all possible pairs of observations
# and record the number of unique interactions
count <- 0
for(i in 1:(N-1)) {
for(j in (i+1):N) {
    count <- count + 1
    int <- interaction(m[i, ], m[j, ], drop=TRUE)
    levelsmat[count, ] <- c(i, j, length(levels(int)))
    }}

# paired observations that had 4 unique interactions
levelsmat[levelsmat[, "nlevels"]==4, ]