从char向量中提取变量

时间:2015-05-01 16:23:15

标签: r statistics glm

我想从数据框创建一个逻辑模型。

#''data.frame':   6532 obs. of  12 variables:
#$ NewsDesk      : chr  "Business" "Culture" "Business" "Business" ...
#$ SectionName   : chr  "Crosswords/Games" "Arts" "Business Day" "Business Day" ...
#$ SubsectionName: chr  "" "" "Dealbook" "Dealbook" ...
#$ Headline      : chr  "More School Daze" "New 96-Page Murakami Work Coming in December" "Public Pension Funds Stay Mum on Corporate Expats" "Boot Camp for Bankers" ...
#$ Snippet       : chr  "A puzzle from Ethan Cooper that reminds me that a bill is due." "The Strange Library will arrive just three and a half months after Mr. Murakamis latest novel, Colorless Tsukuru Tazaki and His"| __truncated__ "Public pension funds have major stakes in American companies moving overseas to cut their tax bills. But they are saying little"| __truncated__ "As they struggle to find new business to bolster sluggish earnings, banks consider the nations 25 million veterans and service "| __truncated__ ...
#$ Abstract      : chr  "A puzzle from Ethan Cooper that reminds me that a bill is due." "The Strange Library will arrive just three and a half months after Mr. Murakamis latest novel, Colorless Tsukuru Tazaki and His"| __truncated__ "Public pension funds have major stakes in American companies moving overseas to cut their tax bills. But they are saying little"| __truncated__ "As they struggle to find new business to bolster sluggish earnings, banks consider the nations 25 million veterans and service "| __truncated__ ...
#$ WordCount     : int  508 285 1211 1405 181 245 258 893 1077 188 ...
#$ PubDate       : POSIXlt, format: "2014-09-01 22:00:09" "2014-09-01 21:14:07" ...
#$ Popular       : int  1 0 0 1 1 1 0 1 1 0 ...

NewsDesk中有11个类别。

       # Business  Culture  Foreign Magazine    Metro National     OpEd  Science   Sports 
# 1846     1548      676      375       31      198        4      521      194        2 
#Styles   Travel   TStyle 
# 297      116      724 

但是,我只需要OpEd, Business, Science, Culture, TStyle根据重要性创建模型。我不知道如何从NewsDesk中提取这些因素?有什么想法?

1 个答案:

答案 0 :(得分:0)

我会这样做。

set.seed(1237)
NewDesk <- sample(c("OpEd", "Business", "Science", "Culture", "TStyle", "Foreign",
         "Magazine", "Metro", "Sports", "Styles", "Travel"), 100, replace = T)
df <- data.frame(Popular = sample(0:1, 100, replace = T), NewDesk = NewDesk)
filter <- c("OpEd", "Business", "Science", "Culture", "TStyle")

head(df[df$NewDesk %in% filter, ])

#   Popular  NewDesk
#1        0  Culture
#3        0     OpEd
#4        0 Business
#5        1  Science
#8        1   TStyle
#11       1 Business