tidyrivot_longer:每行处理多个观察值和值

时间:2019-12-10 23:55:09

标签: r regex tidyr

我需要读取一个Excel文件,该文件每行具有多个观察值和值,且名称复杂。加载时看起来像这样:

Traceback (most recent call last):
  File "/Users/Sam/PycharmProjects/HW7/HW7-q2_Profs-Solution.py", line 22, in <module>
    George = Employee("George", 30, "Male", "Manager", 50000)
  File "/Users/DrewAndMon/PycharmProjects/HW7/HW7-q2_Profs-Solution.py", line 17, in __init__
    Person.__init__(name, age, gender)
TypeError: __init__() missing 1 required positional argument: 'gender'

Process finished with exit code 1

我想使用某种形式的collect / pivot_longer使其整洁,并具有如下所示的输出:

library(tidyverse)
library(janitor)

# An input table read from xlsx, with a format similar to this
# An input table read from xlsx, with a format similar to this
input_table <- tribble(~"product" , 
                       ~"Price Store 1 ($1000/unit)",
                       ~"Quantity in Store 1 (units)",
                       ~"Price Store 2 ($1000/unit)",
                       ~"Quantity in Store 2 (units)",
                       'product a', 10, 100, 20, 70,
                       'product b', 30, 10, 35, 10)

是否有使用# Desired output output_table <- tribble(~'product',~'store',~'price',~'quantity', 'product a', 1, 10, 100, 'product a', 2, 20, 70, 'product b', 1, 30, 10, 'product b', 2, 35, 10) 到达那里的简单方法?提取键号(在这种情况下为store)可能需要一些我不知道如何创建的复杂正则表达式。

2 个答案:

答案 0 :(得分:2)

是的,我们可以

tidyr::pivot_longer(input_table, 
                   cols = -product, 
                   names_to = c(".value", "Store"),
                   names_pattern =  "(\\w+).*?(\\d)")

#  product   Store Price Quantity
#  <chr>     <chr> <dbl>    <dbl>
#1 product a 1        10      100
#2 product a 2        20       70
#3 product b 1        30       10
#4 product b 2        35       10

我们使用Price获得列名(Quantitynames_pattern)以及商店编号。第一个单词(\\w+)是列名,而其后的第一个数字(\\d)被视为商店编号。

答案 1 :(得分:1)

我们可以使用names_pattern中的pivot_longer来匹配一个或多个字母,然后是非数字字符并捕获该数字

library(tidyr)
pivot_longer(input_table, cols = -product, 
               names_to = c(".value", "Store"),
                names_pattern =  "([A-Za-z]+)[^0-9]+([0-9])")
# A tibble: 4 x 4
#  product   Store Price Quantity
#  <chr>     <chr> <dbl>    <dbl>
#1 product a 1        10      100
#2 product a 2        20       70
#3 product b 1        30       10
#4 product b 2        35       10