Question

我现在面对这个问题好几个小时，但我知道我错过了一些明显的问题。

这是我的问题：

我在.xlsx文件中有一个数据框，可以下载here。

我使用MAc上的RStudio将此数据帧加载到R中并将其称为demoData。有5个变量（AgeRange，Women，Men，Total和Year）。

我无法使用AgeRange上的条件对此数据框进行子集化。此变量的格式如下：xx-xx（00-04表示00到04岁之间的人）。我尝试这样做的消息是没有行填充这个条件。变量“AgeRange”的类是因子。

这是我的代码：

demoData[demoData$AgeRange=="00-04",]

感谢您的帮助。

来自Arun的

编辑。这是来自head(demoData)的输入：

     Age Feminin Masculin. Ensemble Annee
1 00-04     720       745     1465  2004 
2 05-09     745       767     1512  2004 
3 10-14     813       830     1643  2004 
4 15-19     824       820     1644  2004 
5 20-24     839       823     1662  2004 
6 25-29     752       699     1450  2004 

# str(demoData)
'data.frame':   272 obs. of  5 variables:
 $ Age      : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Feminin  : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
 $ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
 $ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
 $ Annee    : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...

Answer 1

我使用xlsx包读入xlsx文件：

df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)

它看起来像这样：

> df
        Age Feminin MasculinÂ. Ensemble  Annee
1   00-04Â    720Â       745Â    1465Â  2004Â 
2   05-09Â    745Â       767Â    1512Â  2004Â

你可以用以下内容替换每一列，删除额外的字符：

df$Age<-substr(df$Age,1,5)

或者，使用gsub，因为这将适用于任何列，无论条目的长度如何：

df$Age<-gsub("Â\\s","",df$Age)

然后你的代码就可以了：

df[df$Age=="00-04",]

Answer 2

#coppied from the Excel file 
str1 <- "00-04 "
utf8ToInt(str1)
#[1]  48  48  45  48  52 160

字符串末尾似乎有一个不间断的空格。清理文件。

您应该可以使用

删除不间断空格

df$Age <- gsub(intToUtf8(160),"",df$Age)

R基于格式化为范围（xx-xx）的因子变量对数据帧进行子集化

2 个答案: