用于捕获数字的正则表达式

时间:2015-03-12 00:11:06

标签: regex r

我有一个以下文本(从html文档转换),我试图抓住以下元素:

a)字母符号(由一个以上且少于10个字母组成(仅限大写)

b)括号外的数字(可以是十进制和逗号(如果超过三位数)但不是负数)

c)第一个括号内的数字(可以是十进制和逗号(如果超过三位数)但不是负数)

d)第二个括号内的数字(可以是小数,带逗号(如果超过三位数)和负数)

示例:我想在数据帧中使用ADL,280,3524(不使用逗号)和2作为四列。

df1<-"ADL 280 ( 3,524 ) (  2 )          BDB 485 ( 1,618 ) (  -4 )          CPC 354 ( 5,899 ) (  3 )          EIC 405 ( 791 ) (  -11 )          ALDBL 333 ( 250 ) (  18 )          ALICL 1,262 ( 6,554 ) (  -9 )          ALICLP 410 ( 400 ) (  32 )          HPEX 142 ( 7,732 ) (  -1 )", 

这是我的解决方案:

library(stringr)
# firm name with only alphabets
firms<-str_extract_all(df1,"[A-Z]{2,}") 
#split after the firm name 
split_firm<-strsplit(df1,"[A-Z]+[1]*") 
#split after the bracket 
split_bracket<-strsplit(split_firm[[1]],"\\(|\\)")
#rbind all values
rbind_values<-do.call(rbind,split_bracket)
#we need only only columns 1,2,4
values_matrix<-rbind_values[1:nrow(rbind_values),c(1,2,4)]
#combine values with firm names 
final_df<-data.frame(cbind(trade_com,data.frame(values_matrix)))
names(final_df)<-c("Firms","Inward","Outward","Difference") 
#convert cols into character and then 2:4 into numeric after removing commas
final_df[] <- lapply(final_df,as.character)
final_df[,2:4] <- lapply(final_df[,2:4], function(x) as.numeric(gsub(",", "", x)))

Expected output: 
      Firms Inward Outward Difference
    1    ADL    280    3524          2
    2    BDB    485    1618         -4
    3    CPC    354    5899          3
    4    EIC    405     791        -11
    5  ALDBL    333     250         18
    6  ALICL   1262    6554         -9
    7 ALICLP    410     400         32

我想知道是否可以通过使用正则表达式来缩短上面的代码,例如,捕获没有括号的数字,第一个和第二个括号内的数字没有字符串拆分。

2 个答案:

答案 0 :(得分:4)

看起来你可以将括号和逗号分出来并用空格分割

df1 <- "ADL 280 ( 3,524 ) (  2 )          BDB 485 ( 1,618 ) (  -4 )          CPC 354 ( 5,899 ) (  3 )          EIC 405 ( 791 ) (  -11 )          ALDBL 333 ( 250 ) (  18 )          ALICL 1,262 ( 6,554 ) (  -9 )          ALICLP 410 ( 400 ) (  32 )          HPEX 142 ( 7,732 ) (  -1 )"

x <- gsub('\\(|\\)|,', '', df1)
## or more simply as thelatemail mentions in comments:
x <- gsub('[(),],', '', df1)
as.data.frame(matrix(strsplit(x, '\\s+')[[1]], ncol = 4, byrow = TRUE),
              stringsAsFactors = FALSE)

#       V1   V2   V3  V4
# 1    ADL  280 3524   2
# 2    BDB  485 1618  -4
# 3    CPC  354 5899   3
# 4    EIC  405  791 -11
# 5  ALDBL  333  250  18
# 6  ALICL 1262 6554  -9
# 7 ALICLP  410  400  32
# 8   HPEX  142 7732  -1

然后更改名称并转换为数字:

x <- setNames(x, c('Firms', 'Inward', 'Outward', 'Difference'))
x[, 2:4] <- lapply(x[, 2:4], as.numeric)

答案 1 :(得分:2)

这是 dplyr 方法。有人可能有更多 dplyr -y方式来执行此操作:

df1<-"ADL 280 ( 3,524 ) (  2 )          BDB 485 ( 1,618 ) (  -4 )          CPC 354 ( 5,899 ) (  3 )          EIC 405 ( 791 ) (  -11 )          ALDBL 333 ( 250 ) (  18 )          ALICL 1,262 ( 6,554 ) (  -9 )          ALICLP 410 ( 400 ) (  32 )          HPEX 142 ( 7,732 ) (  -1 )"

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)

df1 %>%
    strsplit("\\s{6,}") %>%
    unlist %>%
    data_frame(x=.) %>%
    extract(x, c("Firms", "Inward", "Outward", "Difference"), 
        "([A-Z]+)\\s+([0-9,]+)[ (]+([0-9,]+)[ )(]+([0-9-]+)") %>%
    mutate(
        Inward = extract_numeric(Inward),
        Outward = extract_numeric(Outward),
        Difference = extract_numeric(Difference)
    )

## Source: local data frame [8 x 4]
## 
##    Firms Inward Outward Difference
## 1    ADL    280    3524          2
## 2    BDB    485    1618         -4
## 3    CPC    354    5899          3
## 4    EIC    405     791        -11
## 5  ALDBL    333     250         18
## 6  ALICL   1262    6554         -9
## 7 ALICLP    410     400         32
## 8   HPEX    142    7732         -1

以下是我保留的 qdapRegex 包中由qdapRegex::explain提供的visuallyverbally正则表达式的解释:

Regular expression visualization

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to \\1:
--------------------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \\1
--------------------------------------------------------------------------------
  \\s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \\2:
--------------------------------------------------------------------------------
    [0-9,]+                  any character of: '0' to '9', ',' (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \\2
--------------------------------------------------------------------------------
  [ (]+                    any character of: ' ', '(' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \\3:
--------------------------------------------------------------------------------
    [0-9,]+                  any character of: '0' to '9', ',' (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \\3
--------------------------------------------------------------------------------
  [ )(]+                   any character of: ' ', ')', '(' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \\4:
--------------------------------------------------------------------------------
    [0-9-]+                  any character of: '0' to '9', '-' (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \\4