我有一个以下文本(从html文档转换),我试图抓住以下元素:
a)字母符号(由一个以上且少于10个字母组成(仅限大写)
b)括号外的数字(可以是十进制和逗号(如果超过三位数)但不是负数)
c)第一个括号内的数字(可以是十进制和逗号(如果超过三位数)但不是负数)
d)第二个括号内的数字(可以是小数,带逗号(如果超过三位数)和负数)
示例:我想在数据帧中使用ADL,280,3524(不使用逗号)和2作为四列。
df1<-"ADL 280 ( 3,524 ) ( 2 ) BDB 485 ( 1,618 ) ( -4 ) CPC 354 ( 5,899 ) ( 3 ) EIC 405 ( 791 ) ( -11 ) ALDBL 333 ( 250 ) ( 18 ) ALICL 1,262 ( 6,554 ) ( -9 ) ALICLP 410 ( 400 ) ( 32 ) HPEX 142 ( 7,732 ) ( -1 )",
这是我的解决方案:
library(stringr)
# firm name with only alphabets
firms<-str_extract_all(df1,"[A-Z]{2,}")
#split after the firm name
split_firm<-strsplit(df1,"[A-Z]+[1]*")
#split after the bracket
split_bracket<-strsplit(split_firm[[1]],"\\(|\\)")
#rbind all values
rbind_values<-do.call(rbind,split_bracket)
#we need only only columns 1,2,4
values_matrix<-rbind_values[1:nrow(rbind_values),c(1,2,4)]
#combine values with firm names
final_df<-data.frame(cbind(trade_com,data.frame(values_matrix)))
names(final_df)<-c("Firms","Inward","Outward","Difference")
#convert cols into character and then 2:4 into numeric after removing commas
final_df[] <- lapply(final_df,as.character)
final_df[,2:4] <- lapply(final_df[,2:4], function(x) as.numeric(gsub(",", "", x)))
Expected output:
Firms Inward Outward Difference
1 ADL 280 3524 2
2 BDB 485 1618 -4
3 CPC 354 5899 3
4 EIC 405 791 -11
5 ALDBL 333 250 18
6 ALICL 1262 6554 -9
7 ALICLP 410 400 32
我想知道是否可以通过使用正则表达式来缩短上面的代码,例如,捕获没有括号的数字,第一个和第二个括号内的数字没有字符串拆分。
答案 0 :(得分:4)
看起来你可以将括号和逗号分出来并用空格分割
df1 <- "ADL 280 ( 3,524 ) ( 2 ) BDB 485 ( 1,618 ) ( -4 ) CPC 354 ( 5,899 ) ( 3 ) EIC 405 ( 791 ) ( -11 ) ALDBL 333 ( 250 ) ( 18 ) ALICL 1,262 ( 6,554 ) ( -9 ) ALICLP 410 ( 400 ) ( 32 ) HPEX 142 ( 7,732 ) ( -1 )"
x <- gsub('\\(|\\)|,', '', df1)
## or more simply as thelatemail mentions in comments:
x <- gsub('[(),],', '', df1)
as.data.frame(matrix(strsplit(x, '\\s+')[[1]], ncol = 4, byrow = TRUE),
stringsAsFactors = FALSE)
# V1 V2 V3 V4
# 1 ADL 280 3524 2
# 2 BDB 485 1618 -4
# 3 CPC 354 5899 3
# 4 EIC 405 791 -11
# 5 ALDBL 333 250 18
# 6 ALICL 1262 6554 -9
# 7 ALICLP 410 400 32
# 8 HPEX 142 7732 -1
然后更改名称并转换为数字:
x <- setNames(x, c('Firms', 'Inward', 'Outward', 'Difference'))
x[, 2:4] <- lapply(x[, 2:4], as.numeric)
答案 1 :(得分:2)
这是 dplyr 方法。有人可能有更多 dplyr -y方式来执行此操作:
df1<-"ADL 280 ( 3,524 ) ( 2 ) BDB 485 ( 1,618 ) ( -4 ) CPC 354 ( 5,899 ) ( 3 ) EIC 405 ( 791 ) ( -11 ) ALDBL 333 ( 250 ) ( 18 ) ALICL 1,262 ( 6,554 ) ( -9 ) ALICLP 410 ( 400 ) ( 32 ) HPEX 142 ( 7,732 ) ( -1 )"
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)
df1 %>%
strsplit("\\s{6,}") %>%
unlist %>%
data_frame(x=.) %>%
extract(x, c("Firms", "Inward", "Outward", "Difference"),
"([A-Z]+)\\s+([0-9,]+)[ (]+([0-9,]+)[ )(]+([0-9-]+)") %>%
mutate(
Inward = extract_numeric(Inward),
Outward = extract_numeric(Outward),
Difference = extract_numeric(Difference)
)
## Source: local data frame [8 x 4]
##
## Firms Inward Outward Difference
## 1 ADL 280 3524 2
## 2 BDB 485 1618 -4
## 3 CPC 354 5899 3
## 4 EIC 405 791 -11
## 5 ALDBL 333 250 18
## 6 ALICL 1262 6554 -9
## 7 ALICLP 410 400 32
## 8 HPEX 142 7732 -1
以下是我保留的 qdapRegex 包中由qdapRegex::explain
提供的visually和verbally正则表达式的解释:
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \\1:
--------------------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \\1
--------------------------------------------------------------------------------
\\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \\2:
--------------------------------------------------------------------------------
[0-9,]+ any character of: '0' to '9', ',' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \\2
--------------------------------------------------------------------------------
[ (]+ any character of: ' ', '(' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \\3:
--------------------------------------------------------------------------------
[0-9,]+ any character of: '0' to '9', ',' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \\3
--------------------------------------------------------------------------------
[ )(]+ any character of: ' ', ')', '(' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \\4:
--------------------------------------------------------------------------------
[0-9-]+ any character of: '0' to '9', '-' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \\4