R - grepl超过700万观测 - 如何提高效率?

时间:2016-05-17 14:00:15

标签: r regex twitter bigdata

我已经遇到了一些我写过的R代码的死胡同,我想也许你知道如何让整个事情变得可行,从某种意义上说,效率可以提高。

所以,我要做的是以下内容:

我有一个有大约700万观察的推文数据集。目前,我对推文的文本或任何其他元数据不感兴趣,但仅在“位置”字段中,因此我将该数据提取到新的data.frame中,其中包含位置变量(字符串)和一个新的,当前为空的“isRelevant”变量(逻辑)。此外,我有一个包含格式如下的文本信息的向量:“地名(1)|地名(2)[...] |地名(i)”。我要做的是grepl位置变量的每一行,以查看是否与Placenames向量匹配,如果是,则在isRelevant变量中返回“TRUE”并返回“FALSE” “ 如果不。

为此,我写了一些R代码,基本归结为这一行:

locations.df$isRelevant <- sapply(locations.df$locations, function(s) grepl(grep_places, s, ignore.case = TRUE))

其中grep_places是由“|”分隔的可能匹配项的列表字符,让R知道它可以匹配向量中的任何元素。我在一台远程高容量计算机上运行它,它使用RStudio(R 3.2.0)提供超过2 TB的RAM,我用'pbsapply'运行它,它为我提供了一个进度条。事实证明,这是荒谬的长。到目前为止,它已经完成了大约45%(我在一周前开始)并且它说它仍然需要超过270小时来完成它。这显然不是一个可行的情况,因为我将来必须使用更大的数据集来运行类似的代码。您是否知道如何在更可接受的时间范围内完成这项工作,也许就像有一天或类似事情(请记住超强计算机)。

修改

这里有一些半模拟数据来表明我正在使用的内容大致如下:

print(grep_places)
> grep_places
"Acworth NH|Albany NH|Alexandria NH|Allenstown NH|Alstead NH|Alton NH|Amherst NH|Andover NH|Antrim NH|Ashland NH|Atkinson NH|Auburn NH|Barnstead NH|Barrington NH|Bartlett NH|Bath NH|Bedford NH|Belmont NH|Bennington NH|Benton NH|Berlin NH|Bethlehem NH|Boscawen NH|Bow NH|Bradford NH|Brentwood NH|Bridgewater NH|Bristol NH|Brookfield NH|Brookline NH|Campton NH|Canaan NH|Candia NH|Canterbury NH|Carroll NH|CenterHarbor NH|Charlestown NH|Chatham NH|Chester NH|Chesterfield NH|Chichester NH|Claremont NH|Clarksville NH|Colebrook NH|Columbia NH|Concord NH|Conway NH|Cornish NH|Croydon NH|Dalton NH|Danbury NH|Danville NH|Deerfield NH|Deering NH|Derry NH|Dorchester NH|Dover NH|Dublin NH|Dummer NH|Dunbarton NH|Durham NH|EastKingston NH|Easton NH|Eaton NH|Effingham NH|Ellsworth NH|Enfield NH|Epping NH|Epsom NH|Errol NH|Exeter NH|Farmington NH|Fitzwilliam NH|Francestown NH|Franconia NH|Franklin NH|Freedom NH|Fremont NH|Gilford NH|Gilmanton NH|Gilsum NH|Goffstown NH|Gorham NH|Goshen NH|Grafton NH|Grantham NH|Greenfield NH|Greenland NH|Greenville NH|Groton NH|Hampstead NH|Hampton NH|HamptonFalls NH|Hancock NH|Hanover NH|Harrisville NH|Hart'sLocation NH|Haverhill NH|Hebron NH|Henniker NH|Hill NH|Hillsborough NH|Hinsdale NH|Holderness NH|Hollis NH|Hooksett NH|Hopkinton NH|Hudson NH|Jackson NH|Jaffrey NH|Jefferson NH|Keene NH|Kensington NH|Kingston NH|Laconia NH|Lancaster NH|Landaff NH|Langdon NH|Lebanon NH|Lee NH|Lempster NH|Lincoln NH|Lisbon NH|Litchfield NH|Littleton NH|Londonderry NH|Loudon NH|Lyman NH|Lyme NH|Lyndeborough NH|Madbury NH|Madison NH|Manchester NH|Marlborough NH|Marlow NH|Mason NH|Meredith NH|Merrimack NH|Middleton NH|Milan NH|Milford NH|Milton NH|Monroe NH|MontVernon NH|Moultonborough NH|Nashua NH|Nelson NH|NewBoston NH|NewCastle NH|NewDurham NH|NewHampton NH|NewIpswich NH|NewLondon NH|Newbury NH|Newfields NH|Newington NH|Newmarket NH|Newport NH|Newton NH|NorthHampton NH|Northfield NH|Northumberland NH|Northwood NH|Nottingham NH|Orange NH|Orford NH|Ossipee NH|Pelham NH|Pembroke NH|Peterborough NH|Piermont NH|Pittsburg NH|Pittsfield NH|Plainfield NH|Plaistow NH|Plymouth NH|Portsmouth NH|Randolph NH|Raymond NH|Richmond NH|Rindge NH|Rochester NH|Rollinsford NH|Roxbury NH|Rumney NH|Rye NH|Salem NH|Salisbury NH|Sanbornton NH|Sandown NH|Sandwich NH|Seabrook NH|Sharon NH|Shelburne NH"


head(location.df, n=20)
>                      location isRelevant
1                      London         NA
2      Orleans village VT USA         NA
3                   The World         NA
4               D M V Towson          NA
5   Playa del Sol Solidaridad         NA
6  Beautiful Downtown Burbank         NA
7                        <NA>         NA
8                          US         NA
9             Gaithersburg Md         NA
10                       <NA>         NA
11                California          NA
12                       Indy         NA
13                    Florida         NA
14              exsnaveen com         NA
15                 Houston TX         NA
16                   Tweaking         NA
17                Phoenix AZ          NA
18              Malibu Ca USA         NA
19           Hermosa Beach CA         NA
20             California USA         NA

在此先感谢大家,我非常感谢您对此的任何帮助。

1 个答案:

答案 0 :(得分:3)

grepl是一个矢量化函数,不需要对它应用循环。你试过了吗?

#dput(location.df)    
location.df<-structure(list(location = structure(c(12L, 14L, 17L, 5L, 16L, 
          2L, 1L, 19L, 8L, 1L, 3L, 11L, 7L, 6L, 10L, 18L, 15L, 13L, 9L, 
         4L), .Label = c("<NA>", "Beautiful Downtown Burbank", "California", 
          "California USA", "D M V Towson", "exsnaveen com", "Florida", 
          "Gaithersburg Md", "Hermosa Beach CA", "Houston TX", "Indy", 
          "London", "Malibu Ca USA", "Orleans village VT USA", "Phoenix AZ", 
          "Playa del Sol Solidaridad", "The World", "Tweaking", "US"), class = "factor"), 
           isRelevant = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
           NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("location", 
          "isRelevant"), row.names = c(NA, -20L), class = "data.frame")

#grep_places with places in the test data
grep_places<-"Gaithersburg Md|Phoenix AZ"

location.df$isRelevant[grepl(grep_places, location.df$location, ignore.case = TRUE)]<-TRUE

或者根据David Arenburg的评论稍微加快实施:

location.df$isRelevant <- grepl(grep_places, location.df$location, ignore.case = TRUE)