我想总结一下我的数据的文本变量。现在我想到了这种代码,但是不知道我需要哪些代码来总结text变量。
dat <- ddply(df, .(user), summarise, links = as.numeric(length(user)),
group = as.factor(substr(user,1,3)),
classif = # some code for creating a string of the variabel "classifications",
unique.classif = # some code for creating a numeric variable of the unique occurences for the variabel "classifications"
)
我可能需要一些正则表达式,但没有经验。有什么建议吗?
用户classif
的 fxf20
应该是:"voeding","ziekte","sociale aspecten","erfelijkheid","ziekte","beweging"
unique.classif
应该生成数字5
(有六种分类,但“ziekte”使用两次)
df <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L), .Label = c("frt01", "frt02", "frt03", "frt04", "frt05",
"frt06"), class = "factor"), url = structure(c(3L, 9L, 5L, 1L,
6L, 8L, 7L, 14L, 9L, 2L, 3L, 6L, 5L, 4L, 13L, 5L, 7L, 5L, 14L,
12L, 2L, 9L, 2L, 4L, 3L, 9L, 11L, 6L, 8L, 5L, 7L, 14L, 3L, 10L,
9L, 6L, 5L, 7L, 14L, 4L), .Label = c("http://dl.dropbox.com/u/57047/google2.html",
"http://mens-en-gezondheid.infonu.nl/ziekten/18079-risicos-van-overgewicht-en-de-gevolgen-van-obesitas.html",
"http://nl.wikipedia.org/wiki/Obesitas", "http://www.erfelijkheid.nl/node/325",
"http://www.hely.net/oorzaken.html", "http://www.nisb.nl/kennisplein-sport-bewegen/dossiers/bewegen-en-overgewicht/oorzaken-obesitas.html",
"http://www.novarum.nl/eetproblemen/obesitas/signalen-en-gevolgen",
"http://www.obesitas.azdamiaan.be/nl/index.aspx?n=280", "http://www.obesitaskliniek.nl/over-obesitas/",
"http://www.obesitasvereniging.nl/", "http://www.sagbmaagband.nl/minder-gewicht/morbideobesitas.html",
"http://www.tweestedenziekenhuis.nl/script/Template_SubsubMenu.asp?PageID=1144&SSMID=1247",
"http://www.vgz.nl/zorg-en-gezondheid/ziektes-en-aandoeningen/obesitas",
"http://www.zuivelengezondheid.nl/?pageID=332"), class = "factor"),
date = structure(c(4L, 5L, 9L, 2L, 3L, 6L, 7L, 9L, 10L, 11L,
14L, 15L, 15L, 16L, 16L, 4L, 5L, 6L, 7L, 7L, 10L, 13L, 5L,
7L, 11L, 12L, 13L, 16L, 17L, 1L, 1L, 1L, 5L, 6L, 6L, 7L,
8L, 9L, 9L, 10L), .Label = c("20-04-2012 10:10:00", "20-04-2012 9:09:37",
"20-04-2012 9:09:42", "20-04-2012 9:09:43", "20-04-2012 9:09:44",
"20-04-2012 9:09:45", "20-04-2012 9:09:46", "20-04-2012 9:09:47",
"20-04-2012 9:09:48", "20-04-2012 9:09:49", "20-04-2012 9:09:50",
"20-04-2012 9:09:52", "20-04-2012 9:09:54", "20-04-2012 9:09:55",
"20-04-2012 9:09:56", "20-04-2012 9:09:57", "20-04-2012 9:09:59"
), class = "factor"), classifications = structure(c(23L,
22L, 17L, 1L, 19L, 10L, 10L, 10L, 10L, 10L, 9L, 9L, 9L, 9L,
9L, 2L, 4L, 3L, 4L, 4L, 4L, 5L, 7L, 8L, 13L, 12L, 11L, 8L,
8L, 8L, 7L, 7L, 15L, 14L, 21L, 17L, 18L, 20L, 6L, 16L), .Label = c(";",
"1;2;", "2watishetprecies;1hoekrijgtiemandhet;", "3gevolgen;",
"3gevolgen;2watishetprecies;", "gevolg;sterfte;ziekte;",
"gevolgen;", "hoekrijgjeobesitas;", "obesitas;", "obesitas;opdracht;",
"obestitas;", "obestitas;gevolgen;", "obestitas;hoekrijgjeobesitas;gevolgen;",
"ontstaan;risicos;overgewicht;", "oorzaak;oorzaken;gevolg;",
"oorzaak;ziekte;type;", "oorzaken;", "oorzaken;bmi;overgewicht;",
"opdracht;obesitas;", "signalen;gevolg;", "wanneer;bmi;",
"watisobesitas;gevolgen;", "watisobesitas;oorzaken;"), class = "factor")), .Names = c("user",
"url", "date", "classifications"), class = "data.frame", row.names = c(NA,
-40L))
我测试了@SvenHohenstein的解决方案,但这并没有完全按预期工作。对于一些用户,我得到了正确的输出,但对于其他一些用户,我没有。用户的输出不是正确的输出:12;,3gevolgen,2watishetprecies1hoekrijgtiemandhet;,3gevolgen2watishetprecies;
如您所见,;
字符串中仍有classif
字符串。此外,12
应为1
和2
。目前我看不出问题是什么。
答案 0 :(得分:2)
如果您要从;
和classif
中排除unique.classif
,可以使用:
ddply(df, .(user), summarise, links = length(user),
group = as.factor(substr(user[1], 1, 3)),
classif = paste(unlist(strsplit(as.character(classifications), ";"))
[unlist(strsplit(as.character(classifications), ";")) != ""],
collapse = ","),
unique.classif = length(unique(unlist(strsplit(as.character(classifications), ";"))
[unlist(strsplit(as.character(classifications), ";")) != ""]))
)