Question

我尝试使用parSapply优化我的R代码。我将 xmlfile 和 X 作为全局变量。

当我没有使用clusterExport（cl，“X”）和clusterExport（cl，“xmlfile”）时，我得到了“找不到xmlfile对象”。

当我使用这两个 clusterExport 时出现错误“类型为'externalptr'的对象不是子集”。

经常使用它可以正常工作。

有人能看到问题吗？

我有这个R代码：

require("XML")
library(parallel)


setwd("C:/PcapParser")
# A helper function that enables the dynamic additon of new rows and unseen variables to a data.frame
# field is an R XML leaf-node (capturing a field of a protocol)
# X is the current data.frame to which the feature in field should be added
# rowNum is the row (packet) to which the feature should be added. [must be that rowNum <= dim(X)[1]+1]
addFeature <- function(field, X, rowNum)
{
  # extract xml name and value
  featureName = xmlAttrs(field)['name']

  if (featureName == "")
    featureName = xmlAttrs(field)['show']

  value = xmlAttrs(field)['value']
  if (is.na(value) | value=="")
    value = xmlAttrs(field)['show']

  # attempt to add feature (add rows/cols if neccessary)
  if (!(featureName %in% colnames(X))) #we are adding a new feature
  { 
    #Special cases 
    #Bad column names: anything that has the prefix...
    badCols = list("<","Content-encoded entity body"," ","\\?")
    for(prefix in badCols)
      if(grepl(paste("^",prefix,sep=""),featureName))
        return(X) #don't include this new feature

    X[[featureName]]=array(dim=dim(X)[1]) #add this new feature column with NAs
  } 

  if (rowNum > dim(X)[1]) #we are trying to add a new row
  {X = rbind(X,array(dim=dim(X)[2]))} #add row of NA

  X[[featureName]][rowNum] = value 
  return(X)
}

firstLoop<-function(x)
{


  packet = xmlfile[[x]]

  # Iterate over all protocols in this packet
  for (prot in 1:xmlSize(packet))
  {
    protocol = packet[[prot]]
    numFields = xmlSize(protocol)

    # Iterate over all fields in this protocol (recursion is not used since the passed dataset is large)
    if(numFields>0)
      for (f in 1:numFields)
      {
        field = protocol[[f]]

        if (xmlSize(field) == 0) # leaf
          X<<-addFeature(field,X,x)
        else #not leaf xml element (assumption: there are at most three more steps down)
        {
          # Iterate over all sub-fields in this field
          for (ff in 1:xmlSize(field))
          { #extract sub-field data for this packet
            subField = field[[ff]]

            if (xmlSize(subField) == 0) # leaf
              X<<-addFeature(subField,X,x)
            else #not leaf xml element (assumption: there are at most two more steps down)
            {
              # Iterate over all subsub-fields in this field
              for (fff in 1:xmlSize(subField))
              { #extract sub-field data for this packet
                subsubField = subField[[fff]]

                if (xmlSize(subsubField) == 0) # leaf
                  X<<-addFeature(subsubField,X,x)
                else #not leaf xml element (assumption: there is at most one more step down)
                {
                  # Iterate over all subsubsub-fields in this field
                  for (ffff in 1:xmlSize(subsubField))
                  { #extract sub-field data for this packet
                    subsubsubField = subsubField[[ffff]]
                    X<<-addFeature(subsubsubField,X,x) #must be leaf
                  }
                }
              }
            }
          }
        }
      }
  }
}
# Given the path to a pcap file, this function returns a dataframe 'X' 
# with m rows that contain data fields extractable from each of the m packets in XMLcap.
# Wireshark must be intalled to work
raw_feature_extractor <- function(pcapPath){
  ## Step 1: convert pcap into PDML XML file with wireshark
  #to run this line, wireshark must be installed in the location referenced in the pdmlconv.bat file
  print("Converting pcap file with Wireshark.")
  system(paste("pdmlconv",pcapPath,"tmp.xml"))

  ## Step 2: load XML file into R
  print("Parsing XML.")
  xmlfile<<-xmlRoot(xmlParse("tmp.xml"))

  ## Step 3: Extract all feature into data.frame
  print("Extracting raw features.")
  X <<- data.frame(num=NA) #first feature is packet number


  # Iterate over all packets 
  # Calculate the number of cores
  no_cores <- detectCores() - 1

  # Initiate cluster
  cl <- makeCluster(3)

  parSapply (cl,seq(from=1,to=xmlSize(xmlfile),by=1),firstLoop)




  print("Done.")
 return(X)
}

我对parSapply有什么问题？（也许考虑全局变量）

谢谢

Answer 1

所以我看到这个代码有几个明显的问题。全局变量和函数在并行环境中不可访问，除非您明确强制它们或调用它们。您需要在addFunction中定义raw_feature_extractor和firstLoop个功能。从预先存在的包中调用函数时，您应该将包作为firstLoop的一部分加载（编码错误！），或者使用package::function表示法明确调用它们（良好的编码！）。我建议在StackOverflow上查看R文档，以帮助您创建适当的并行化函数。

R并行代码中的parSapply问题

1 个答案: