如何用ddply汇总文本字符串?

时间:2014-02-12 13:59:17

标签: regex string r plyr

我想总结一下我的数据的文本变量。现在我想到了这种代码,但是不知道我需要哪些代码来总结text变量。

dat <- ddply(df, .(user), summarise, links = as.numeric(length(user)),
               group = as.factor(substr(user,1,3)),
               classif = # some code for creating a string of the variabel "classifications", 
               unique.classif = # some code for creating a numeric variable of the unique occurences for the variabel "classifications"
               )

我可能需要一些正则表达式,但没有经验。有什么建议吗?

用户classif

fxf20应该是:"voeding","ziekte","sociale aspecten","erfelijkheid","ziekte","beweging"

在这种情况下,{p> unique.classif应该生成数字5(有六种分类,但“ziekte”使用两次)

df <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L), .Label = c("frt01", "frt02", "frt03", "frt04", "frt05", 
"frt06"), class = "factor"), url = structure(c(3L, 9L, 5L, 1L, 
6L, 8L, 7L, 14L, 9L, 2L, 3L, 6L, 5L, 4L, 13L, 5L, 7L, 5L, 14L, 
12L, 2L, 9L, 2L, 4L, 3L, 9L, 11L, 6L, 8L, 5L, 7L, 14L, 3L, 10L, 
9L, 6L, 5L, 7L, 14L, 4L), .Label = c("http://dl.dropbox.com/u/57047/google2.html", 
"http://mens-en-gezondheid.infonu.nl/ziekten/18079-risicos-van-overgewicht-en-de-gevolgen-van-obesitas.html", 
"http://nl.wikipedia.org/wiki/Obesitas", "http://www.erfelijkheid.nl/node/325", 
"http://www.hely.net/oorzaken.html", "http://www.nisb.nl/kennisplein-sport-bewegen/dossiers/bewegen-en-overgewicht/oorzaken-obesitas.html", 
"http://www.novarum.nl/eetproblemen/obesitas/signalen-en-gevolgen", 
"http://www.obesitas.azdamiaan.be/nl/index.aspx?n=280", "http://www.obesitaskliniek.nl/over-obesitas/", 
"http://www.obesitasvereniging.nl/", "http://www.sagbmaagband.nl/minder-gewicht/morbideobesitas.html", 
"http://www.tweestedenziekenhuis.nl/script/Template_SubsubMenu.asp?PageID=1144&SSMID=1247", 
"http://www.vgz.nl/zorg-en-gezondheid/ziektes-en-aandoeningen/obesitas", 
"http://www.zuivelengezondheid.nl/?pageID=332"), class = "factor"), 
    date = structure(c(4L, 5L, 9L, 2L, 3L, 6L, 7L, 9L, 10L, 11L, 
    14L, 15L, 15L, 16L, 16L, 4L, 5L, 6L, 7L, 7L, 10L, 13L, 5L, 
    7L, 11L, 12L, 13L, 16L, 17L, 1L, 1L, 1L, 5L, 6L, 6L, 7L, 
    8L, 9L, 9L, 10L), .Label = c("20-04-2012 10:10:00", "20-04-2012 9:09:37", 
    "20-04-2012 9:09:42", "20-04-2012 9:09:43", "20-04-2012 9:09:44", 
    "20-04-2012 9:09:45", "20-04-2012 9:09:46", "20-04-2012 9:09:47", 
    "20-04-2012 9:09:48", "20-04-2012 9:09:49", "20-04-2012 9:09:50", 
    "20-04-2012 9:09:52", "20-04-2012 9:09:54", "20-04-2012 9:09:55", 
    "20-04-2012 9:09:56", "20-04-2012 9:09:57", "20-04-2012 9:09:59"
    ), class = "factor"), classifications = structure(c(23L, 
    22L, 17L, 1L, 19L, 10L, 10L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 
    9L, 2L, 4L, 3L, 4L, 4L, 4L, 5L, 7L, 8L, 13L, 12L, 11L, 8L, 
    8L, 8L, 7L, 7L, 15L, 14L, 21L, 17L, 18L, 20L, 6L, 16L), .Label = c(";", 
    "1;2;", "2watishetprecies;1hoekrijgtiemandhet;", "3gevolgen;", 
    "3gevolgen;2watishetprecies;", "gevolg;sterfte;ziekte;", 
    "gevolgen;", "hoekrijgjeobesitas;", "obesitas;", "obesitas;opdracht;", 
    "obestitas;", "obestitas;gevolgen;", "obestitas;hoekrijgjeobesitas;gevolgen;", 
    "ontstaan;risicos;overgewicht;", "oorzaak;oorzaken;gevolg;", 
    "oorzaak;ziekte;type;", "oorzaken;", "oorzaken;bmi;overgewicht;", 
    "opdracht;obesitas;", "signalen;gevolg;", "wanneer;bmi;", 
    "watisobesitas;gevolgen;", "watisobesitas;oorzaken;"), class = "factor")), .Names = c("user", 
"url", "date", "classifications"), class = "data.frame", row.names = c(NA, 
-40L))

我测试了@SvenHohenstein的解决方案,但这并没有完全按预期工作。对于一些用户,我得到了正确的输出,但对于其他一些用户,我没有。用户的输出不是正确的输出:12;,3gevolgen,2watishetprecies1hoekrijgtiemandhet;,3gevolgen2watishetprecies;

如您所见,;字符串中仍有classif字符串。此外,12应为12。目前我看不出问题是什么。

1 个答案:

答案 0 :(得分:2)

如果您要从;classif中排除unique.classif,可以使用:

ddply(df, .(user), summarise, links = length(user),
      group = as.factor(substr(user[1], 1, 3)),
      classif = paste(unlist(strsplit(as.character(classifications), ";"))
                      [unlist(strsplit(as.character(classifications), ";")) != ""], 
                      collapse = ","),
      unique.classif = length(unique(unlist(strsplit(as.character(classifications), ";"))
                                     [unlist(strsplit(as.character(classifications), ";")) != ""]))
)