我有一个包含两列的文件,一个包含HTTP对象的内容类型,如text / html,application / rar等,另一列具有字节大小。
Content Type Size
video/x-flv 100
image/jpeg 150
text/html 160
application/octet-stream 200
application/x-shockwave-flash ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8 ...
正如您所看到的,相同内容类型有许多变体,例如application/x-javascript
和application/x-javascript; charset=utf-8
等。所以,我想创建另一个列来更一般地对它们进行分类。所以,这两个只是web/javascript
等等。
Content Type Size Category
video/x-flv 100 web/video
image/jpeg 150 web/image
text/html 160 web/html
application/octet-stream 200 web/binary
application/x-shockwave-flash ... web/flash
text/plain web/plaintext
application/x-javascript web/javascript
video/x-msvideo web/video
text/xml web/xml
text/css web/css
text/html; charset=utf-8 web/html
video/quicktime web/video
application/x-javascript; charset=utf-8 web/javascript
我如何在R中完成此操作,我认为我需要使用某种正则表达式?
答案 0 :(得分:3)
有几种方法可以简化变量。在这里,我将使用stringr
包来进行字符串操作:
R> library(stringr)
首先,将内容类型变量复制到新的字符变量中:
R> d <- data.frame(type=c("video/x-flv", "image/jpeg","video/x-msvideo", "application/x-javascript; charset=utf-8", "application/x-javascript"))
R> d$type2 <- as.character(d$type)
这只是给你:
type type2
1 video/x-flv video/x-flv
2 image/jpeg image/jpeg
3 video/x-msvideo video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5 application/x-javascript application/x-javascript
然后你可以处理你的新变量。您可以手动替换另一个类型的值:
R> d$type2[d$type2 == "video/x-flv"] <- "video"
R> d
type type2
1 video/x-flv video
2 image/jpeg image/jpeg
3 video/x-msvideo video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5 application/x-javascript application/x-javascript
您可以使用正则表达式匹配来替换所有匹配的值,例如“video”:
R> d$type2[str_detect(d$type2, ".*video.*")] <- "video"
R> d
type type2
1 video/x-flv video
2 image/jpeg image/jpeg
3 video/x-msvideo video
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5 application/x-javascript application/x-javascript
或者您可以使用regexp替换来清除某些值。例如,删除“;”后面的所有内容在您的内容类型中:
R> d$type2 <- str_replace(d$type2, ";.*$", "")
R> d
type type2
1 video/x-flv video
2 image/jpeg image/jpeg
3 video/x-msvideo video
4 application/x-javascript; charset=utf-8 application/x-javascript
5 application/x-javascript application/x-javascript
请注意您的指示顺序,因为您的结果高度依赖于它。
答案 1 :(得分:1)
如果必须手动完成,可以将因子分配到相应的类别中。在这个例子中,我将字母表中的前13个字母分组为“1”,将字母的后半部分分组为“2”。
> x <- as.factor(sample(letters, 100, replace = TRUE))
> x
[1] d n p n k l a x c n v p l o u e z m y x t r q b l n y s s m d u l l a d k
[38] t a p x s g w i p l b s o t b s h h v c b j o p h f j m v d r m x o d l e
[75] l f y l u e w f e e o s w s m v a z q l a t f z x s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> levels(x)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> levels(x) <- c(rep(1, 13), rep(2, 13))
> x
[1] 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1
[38] 2 1 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1
[75] 1 1 2 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 2
Levels: 1 2
> levels(x)
[1] "1" "2"
如果您的示例包含(仅)因素,例如:
"video/x-flv" "image/jpeg" "video/x-msvideo" "application/x-javascript; charset=utf-8"
...你会像这样对你的关卡进行编码:
levels(obj) <- c("web/video", "web/image", "web/video", "web/javascript")
答案 2 :(得分:1)
假设DF
是我们的数据框。定义正则表达式re
以匹配感兴趣的字符串,然后使用strapply
包中的gsubfn
提取它们,为每个字符串添加前缀"web/"
。在strapply
语句中,我们已将DF[[1]]
转换为字符,以防它是一个因子而不是字符向量。 NULL
条目未匹配,因此我们假设这些条目为"web/binary"
。最后将"plain"
的所有内容展开为"plaintext"
:
> library(gsubfn)
> re <- "(video|image|html|flash|plain|javascript|xml|css).*"
> short <- strapply(as.character(DF[[1]]), re, ~ paste("web", x, sep = "/"))
> DF$short <- sapply(short, function(x) if (is.null(x)) "web/binary" else x)
> DF$short <- sub("plain", "plaintext", DF$short)
> DF
Content short
1 video/x-flv web/video
2 image/jpeg web/image
3 text/html web/html
4 application/octet-stream web/binary
5 application/x-shockwave-flash web/flash
6 text/plain web/plaintext
7 application/x-javascript web/javascript
8 video/x-msvideo web/video
9 text/xml web/xml
10 text/css web/css
11 text/html; charset=utf-8 web/html
12 video/quicktime web/video
13 application/x-javascript; charset=utf-8 web/javascript
http://gsubfn.googlecode.com上的gsubfn
包中有更多信息。