从许多数据框列中提取最后一个字(R)

时间:2017-01-21 03:14:20

标签: r

我有一个包含3列的数据框。数据看起来像这样

V1                V2               V3
Auto = Chevy      Engine = V6      Trans = Auto
Auto = Chevy      Engine = V8      Trans = Manual
Auto = Chevy      Engine = V10     Trans = Manual

我希望数据框看起来像这样:

Auto       Engine  Trans
Chevy      V6      Auto
Chevy      V8      Manual
Chevy      V10     Manual

换句话说,检索" ="之后的最后一个字符串。并获取列中的第一个值,并使其成为列标题。或者只是检索" ="之后的最后一个单词。并在不添加新列的情况下将其替换为列。

这可以在R中完成吗?非常感谢!

3 个答案:

答案 0 :(得分:5)

好吧,如果你不介意只使用旧式(前哈德利)R,这是一个解决方案:

> x <- as.data.frame(list(c('Auto = Chevy', 'Auto = Chevy', 'Auto = Chevy'),
+ c('Engine = V6', 'Engine = V8', 'Engine = V10'),
+ c('Trans = Auto', 'Trans = Manual', 'Trans = Manual')),
+ stringsAsFactors=FALSE)
> values <- lapply(x, gsub, pattern='.*= ', replacement='')
> new.names <- lapply(x, gsub, pattern=' =.*', replacement='')
> new.names <- lapply(new.names, unique)
> names(values) <- new.names
> new.frame <- as.data.frame(values, stringsAsFactors = FALSE)
> new.frame
   Auto Engine  Trans
1 Chevy     V6   Auto
2 Chevy     V8 Manual
3 Chevy    V10 Manual

它不适用于包含许多列的数据框,但它适用于包含许多行的窄数据框。

答案 1 :(得分:4)

或者,我们可以避免使用stringr拐杖,并在stringi中使用高度优化的函数(大多数stringr函数包裹stringi函数) :

library(stringi)
library(dplyr)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

mutate_all(df, funs(stri_extract_last_words))
##      V1  V2     V3
## 1 Chevy  V6   Auto
## 2 Chevy  V8 Manual
## 3 Chevy V10 Manual

更具代表性的tidyverse带有“列名”req,如果列不像你想象的那样可能会破坏你的R脚本:

library(stringi)
library(dplyr)
library(purrr)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

mutate_all(df, funs(stri_extract_last_words)) %>%
  setNames(mutate_all(df, stri_extract_first_words) %>%
             distinct() %>%
             flatten_chr())

更多tidyverse和stringi,如果列不是您想象的那样,可能实际上会破坏您的R脚本的假设要求非常多:

library(stringi)
library(tidyverse)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

by_row(df, function(x) {
  map(x, stri_match_all_regex, "(.*) = (.*)") %>%
    map(1) %>%
    map(~setNames(.[,3], .[,2])) %>%
    flatten_df()
}) %>%
  select(.out) %>%
  unnest()
## # A tibble: 3 × 3
##    Auto Engine  Trans
##   <chr>  <chr>  <chr>
## 1 Chevy     V6   Auto
## 2 Chevy     V8 Manual
## 3 Chevy    V10 Manual

答案 2 :(得分:3)

我们只能使用base R选项

执行此操作

1)使用scansub - 转换=后,移除子字符串sub后跟空格data.framematrix,然后使用scan返回vector个字词。基于逻辑向量(c(FALSE, TRUE))的回收,我们得到了&#39; v1&#39;中的交替词。并将输出分配给&#39; df2&#39;我们使用从&#39; v1&#39;中提取的备用值的unique元素更改列名称使用c(TRUE, FALSE)作为逻辑回收vector

df2 <- df1
v1 <- scan(text=sub("=\\s+", "", as.matrix(df1)), what="", sep=" ", quiet=TRUE)
df2[] <- v1[c(FALSE, TRUE)]
colnames(df2) <- unique(v1[c(TRUE, FALSE)])
df2
#   Auto Engine  Trans
#1 Chevy     V6   Auto
#2 Chevy     V8 Manual
#3 Chevy    V10 Manual

2)使用sub - 通过将其作为一组捕获来提取最后一个单词,并在循环遍历列后将其替换为反向引用(\\1)({{ 1}})

lapply(df1, ..

3)使用df2[] <- lapply(df1, function(x) sub(".*\\b(\\w+)$", "\\1", x)) - 通过分隔符(strsplit)拆分字符串并获取最后一个元素("=\\s+),同时循环遍历列,如的 2)

tail, 1

我们通过df2[] <- lapply(df1, function(x) sapply(strsplit(x, "=\\s+"), tail, 1)) 第一行sub上的unlist提取第一个单词来更改第二和第三个解决方案中的列

colnames(df2) <- sub("\\s+=.*", "", unlist(df1[1,], use.names = FALSE))

或其他选项基于包解决方案

1)使用str_extract - 通过使用{{1}循环遍历字符串的结尾\\w+之前提取单词($)并将lapply输出分配给原始数据集的副本(&#39; df2&#39;)。然后,我们通过在list原始数据集的第一行上使用sub提取第一个单词来更改列名。

unlist

2)使用library(stringr) df2[] <- lapply(df1, function(x) str_extract(x, "\\w+$")) colnames(df2) <- word(unlist(df1[1,]), 1) df2 # Auto Engine Trans #1 Chevy V6 Auto #2 Chevy V8 Manual #3 Chevy V10 Manual

tidyverse

数据

library(dplyr)
library(tidyr)
gather(df1) %>% 
      separate(value, into = c("header", "value")) %>%
      group_by(key) %>%
      mutate(i1 = row_number()) %>% 
      ungroup() %>% 
      select(-key) %>% 
      spread(header, value) %>%
      select(-i1)
# A tibble: 3 × 3
#   Auto Engine  Trans
#* <chr>  <chr>  <chr>
#1 Chevy     V6   Auto
#2 Chevy     V8 Manual
#3 Chevy    V10 Manual