Question

你好我有数据集，包括文本，整数和十进制数，文本是一个段落，将有所有这些混合，试图从文本内容中删除整数和十进制数，那里是大约30k的特罗条目。

输入数据格式：

此。是一个很好的13部分。 135.67代码
如何在内容6879中删除66.8
从中获取数字3475.5。数据。这个369426中的879

输出：

13 135.67
66.8 6879
3475.5 879 369426

我尝试逐个替换所有字母，但26 + 26替换所有使代码冗长，并替换“。”替换“。”从数字也谢谢，普利文

Answer 1

你可以尝试

library(stringr)
lapply(str_extract_all(a, "[0-9.]+"), function(x) as.numeric(x)[!is.na(as.numeric(x))])
[[1]]
[1]  13.00 135.67

[[2]]
[1]   66.8 6879.0

[[3]]
[1]   3475.5    879.0 369426.0

基本想法来自here，但我们包含.。 lapply转换为数字并排除NA的

数据：

a <- c("This. Is a good 13 part. of 135.67 code",
       "how to strip 66.8 in the content 6879",
       "get the numbers 3475.5 from. The data. 879 in this 369426")

Answer 2

不要忘记R已经内置了正则表达式函数：

input <- c('This. Is a good 13 part. of 135.67 code', 'how to strip 66.8 in the content 6879',
           'get the numbers 3475.5 from. The data. 879 in this 369426')

m <- gregexpr('\\b\\d+(?:\\.\\d+)?\\b', input)
(output <- lapply(regmatches(input, m), as.numeric))

这会产生

[[1]]
[1]  13.00 135.67

[[2]]
[1]   66.8 6879.0

[[3]]
[1]   3475.5    879.0 369426.0

Answer 3

使用strsplit拆分的选项，然后使用gsub替换[:alpha]或.或之后的[:alpha]

text <- "1. This. Is a good 13 part. of 135.67 code
2. how to strip 66.8 in the content 6879
3. get the numbers 3475.5 from. The data. 879 in this 369426"

lines <- strsplit(text, split = "\n")[[1]]
gsub("[[:alpha:]]+\\.|[[:alpha:]]+\\s*","",lines)
#[1] "1.  13  135.67 "       
#[2] "2. 66.8 6879"          
#[3] "3. 3475.5   879 369426"

Answer 4

使用gsub的另一种方法：

string = c('This. Is a good 13 part. of 135.67 code', 
           'how to strip 66.8 in the content 6879',
           'get the numbers 3475.5 from. The data. 879 in this 369426')

trimws(gsub('[\\p{L}\\.\\s](?!\\d)+', '', string, perl = TRUE))
# [1] "13 135.67"         "66.8 6879"         "3475.5 879 369426"

Answer 5

没有正则表达式和外部包的解决方案：

sapply(
  strsplit(input, " "),
  function(x) {
    x <- suppressWarnings(as.numeric(x))
    paste(x[!is.na(x)], collapse = " ")
  }
)
[1] "13 135.67"         "66.8 6879"         "3475.5 879 369426"

从文本中删除数字：R

5 个答案: