提高R脚本效率

时间:2013-04-26 11:51:19

标签: r loops time performance

我正在尝试匹配两个非常大的数据(nsar& crsp)集。我的代码工作得很好,但需要很多时间。我的程序按以下方式工作:

  1. 通过自动收报机尝试匹配(从而控制NAV(只是一个数字)和日期 是一样的)
  2. 通过确切的基金名称(控制资产净值和日期)尝试匹配
  3. 通过最接近的匹配尝试匹配:首先搜索相同的NAV&日期 - >列出并仅考虑那些匹配度量最接近的公司 - >获取剩余的条目并找到最接近的匹配(但匹配距离受到限制)。
  4. 有关如何提高代码效率的任何建议:

    #Go through each nsar entry and try to match with crsp
    trackchanges = sapply(seq_along(nsar$fund),function(x){
    
        #Define vars
        ticker = nsar$ticker[x]
        r_date = format(nsar$r_date[x], "%m%Y")
        nav1 = nsar$NAV_share[x]
        nav2 = nsar$NAV_sshare[x]
        searchbyname = 0
    
        if(nav1 == 0) nav1 = -99
        if(nav2 == 0) nav2 = -99
    
        ########## If ticker is available --> Merge via ticker and NAV
        if(is.na(ticker) == F)
        {
    
            #Look for same NAV, date and ticker
            found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
    
    
            #If nothing found
            if(length(found) == 0)
            {
    
                #Mark that you should search by names
                searchbyname = 1
    
            } else { #ticker found 
    
                        #Record crsp_fundno and that match is found
                nsar$match[x] = 1 
                nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]] 
                assign("nsar",nsar,envir=.GlobalEnv)
    
                #Return: 1 --> Merged by ticker
                return(1)
            }
    
        } 
    
        ###########
    
        ########### No Ticker available or found --> Exact name matching
        if(is.na(ticker) == T | searchbyname == 1)
        {
    
            #Define vars
            name = tolower(nsar$fund[x])
            company = tolower(nsar$company[x])
    
            #Exact name, date and same NAV
            found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
    
    
    
            #If nothing found
            if(length(found) == 0)
            {
    
                #####Continue searching by closest match
    
                    #First search for nav and date to get list of funds
                    allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
                    allfunds_companies = crsp$company[allfunds]
    
                    #Check if anything found
                    if(length(allfunds) == 0)
                    {
                        #Return: 0 --> nothing found
                        return(0)
                    }
    
                    #Get best match by lev and substring measure for company
                    levmatch = levenstheinMatch(company, allfunds_companies)
                    submatch = substringMatch(company, allfunds_companies)
    
                    allfunds = levmatch[levmatch %in% submatch]
                    allfunds_names = crsp$fund_name2[allfunds]
    
                    #Check if now anything found
                    if(length(allfunds) == 0)
                    {
                        #Mark match (5=Company not found)
                        nsar$match[x] = 5 
    
                        #Save globally
                        assign("nsar",nsar,envir=.GlobalEnv)
    
                        #Return: 5 --> Company not found
                        return(5)
                    }
    
    
                    #Get best match by all measures
                    levmatch = levenstheinMatch(name, allfunds_names)
                    submatch = substringMatch(name, allfunds_names)
    
    
                    #Only accept if identical
                    allfunds = levmatch[levmatch %in% submatch]
                    allfunds_names = crsp$fund_name2[allfunds]
    
    
                    if(length(allfunds) > 0)
                    {
                        #Mark match (3=closest name matching)
                        nsar$match[x] = 3 
    
                        #Add crsp_fundno to nsar data
                        nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]] 
    
                        #Save globally
                        assign("nsar",nsar,envir=.GlobalEnv)
    
                        #Return 3=closest name matching
                        return(3)
    
                    } else {
                        #return 0 -> no match
                        return(0)
                    }
    
                #####
    
            } else { #If exact name,date,nav found
    
                #Mark match (2=exact name matching)
                nsar$match[x] = 2 
    
                #Add crsp_fundno to nsar data
                nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]] 
    
                #Return 2=exact name matching
                return(2)
            }
        }   
    
    
    
    
    
    })#End sapply
    

    非常感谢您的帮助! Laurenz

1 个答案:

答案 0 :(得分:2)

脚本太复杂,无法提供完整的答案,但基本问题在第一行

#Go through each nsar entry...

以迭代方式列出问题。 R最适合矢量。

提升您开始计算的sapply中的可矢量化组件。例如,格式化r_date

nsar$r_date_f <- format(nsar$r_date, "%m%Y")

此建议也适用于深埋在代码中的行,例如计算圆形crsp $ mnav应该只在整个列上执行一次

crsp$mnav_r <- round(crsp$mnav, 1)

在适当情况下使用R惯用语,如果“-99”表示缺失值,则使用NA

nav1 <- nsar$NAV_share
nav1[nav1 == -99] <- NA
nasr$nav1 <- nav1

您可能使用的其他软件包中的代码更有可能正确处理NA。

使用完善的R函数进行更复杂的查询。这很棘手,但是如果我正确地阅读您的代码,那么关于“相同NAV,日期和自动收报机”的查询可以使用merge来进行连接,假设这些列是由代码中较早的向量化操作创建的,如

nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE]
df0 <- merge(nasr1, crsp, 
             by.x = c("ticker", rdate_r", "nav1_r"),
             by.y = c("nasdaq", "caldt2", "mnav_r"))

这不包括“|”条件,因此需要额外的工作。 plyr,data.table和sqldf软件包(以及其他软件包)的开发部分是为了简化这些类型的操作,因此可能值得研究,因为您可以更加适应矢量化计算。

这很难说,但我认为这三个步骤解决了代码中的主要挑战。