如何从列的每一行中移除某些字符

时间:2017-03-24 20:11:31

标签: r

我有这个数据表:

Year    GDP
1998–99 <U+20B9>1,668,739
1999–00 <U+20B9>1,858,205
2000–01 <U+20B9>2,000,743
2001–02 <U+20B9>2,175,260
2002–03 <U+20B9>2,343,864
2003–04 <U+20B9>2,625,819
2004–05 <U+20B9>2,971,464
2005–06 <U+20B9>3,390,503
2006–07 <U+20B9>3,953,276
2007–08 <U+20B9>4,582,086
2008–09 <U+20B9>5,303,567
2009–10 <U+20B9>6,108,903
2010–11 <U+20B9>7,248,860
2011–12 <U+20B9>8,391,691
2012–13 <U+20B9>9,388,876

我想要做的是从所有行中删除“”。我该怎么办?

我尝试使用greplgrep,但不适合我:

df[!grepl("<U+20B9>", df$GDP),]

df[ grep("REVERSE", df$Name, invert = TRUE) , ]

这些对我不起作用......

我想要的是这样的:

Year    GDP
1998–99 1,668,739
1999–00 1,858,205
2000–01 2,000,743
2001–02 2,175,260
2002–03 2,343,864
2003–04 2,625,819
2004–05 2,971,464
2005–06 3,390,503
2006–07 3,953,276
2007–08 4,582,086
2008–09 5,303,567
2009–10 6,108,903
2010–11 7,248,860
2011–12 8,391,691
2012–13 9,388,876

我也试过使用以下解决方案,但对我来说也不起作用...... How to identify/delete non-UTF-8 characters in R

x <- "<U+20B9>"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='')

returns me "<U+20B9>" as it is...

2 个答案:

答案 0 :(得分:1)

使用一些示例数据进行data.table尝试

data <- setDT(data.frame(
 Year=c('1998–99', 
     '1999–00', 
     '2000–01', 
     '2001–02', 
     '2002–03', 
     '2003–04', 
     '2004–05', 
     '2005–06', 
     '2006–07', 
     '2007–08'),
 GDP=c('<U+20B9>1,668,739',
    '<U+20B9>1,858,205',
    '<U+20B9>2,000,743',
    '<U+20B9>2,175,260',
    '<U+20B9>2,343,864',
    '<U+20B9>2,625,819',
    '<U+20B9>2,971,464',
    '<U+20B9>3,390,503',
    '<U+20B9>3,953,276',
    '<U+20B9>4,582,086')))

data[,GDP:=sub("^\\s*<U\\+\\w+>\\s*",'',data$GDP)]

此常规epxression模式可视为:

  1. U \ \ + part意味着像U +

  2. 的序列
  3. \ \ w +简单地说明字母或数字化,不仅仅是1

  4. 这部分包含在&lt; &GT;然后\ \ s *只删除任何空格

答案 1 :(得分:0)

上面最小的答案是:

df$GDP <- substring(df$GDP, 2)