我需要读取一个Excel文件,该文件每行具有多个观察值和值,且名称复杂。加载时看起来像这样:
Traceback (most recent call last):
File "/Users/Sam/PycharmProjects/HW7/HW7-q2_Profs-Solution.py", line 22, in <module>
George = Employee("George", 30, "Male", "Manager", 50000)
File "/Users/DrewAndMon/PycharmProjects/HW7/HW7-q2_Profs-Solution.py", line 17, in __init__
Person.__init__(name, age, gender)
TypeError: __init__() missing 1 required positional argument: 'gender'
Process finished with exit code 1
我想使用某种形式的collect / pivot_longer使其整洁,并具有如下所示的输出:
library(tidyverse)
library(janitor)
# An input table read from xlsx, with a format similar to this
# An input table read from xlsx, with a format similar to this
input_table <- tribble(~"product" ,
~"Price Store 1 ($1000/unit)",
~"Quantity in Store 1 (units)",
~"Price Store 2 ($1000/unit)",
~"Quantity in Store 2 (units)",
'product a', 10, 100, 20, 70,
'product b', 30, 10, 35, 10)
是否有使用# Desired output
output_table <- tribble(~'product',~'store',~'price',~'quantity',
'product a', 1, 10, 100,
'product a', 2, 20, 70,
'product b', 1, 30, 10,
'product b', 2, 35, 10)
到达那里的简单方法?提取键号(在这种情况下为store)可能需要一些我不知道如何创建的复杂正则表达式。
答案 0 :(得分:2)
是的,我们可以
tidyr::pivot_longer(input_table,
cols = -product,
names_to = c(".value", "Store"),
names_pattern = "(\\w+).*?(\\d)")
# product Store Price Quantity
# <chr> <chr> <dbl> <dbl>
#1 product a 1 10 100
#2 product a 2 20 70
#3 product b 1 30 10
#4 product b 2 35 10
我们使用Price
获得列名(Quantity
或names_pattern
)以及商店编号。第一个单词(\\w+
)是列名,而其后的第一个数字(\\d
)被视为商店编号。
答案 1 :(得分:1)
我们可以使用names_pattern
中的pivot_longer
来匹配一个或多个字母,然后是非数字字符并捕获该数字
library(tidyr)
pivot_longer(input_table, cols = -product,
names_to = c(".value", "Store"),
names_pattern = "([A-Za-z]+)[^0-9]+([0-9])")
# A tibble: 4 x 4
# product Store Price Quantity
# <chr> <chr> <dbl> <dbl>
#1 product a 1 10 100
#2 product a 2 20 70
#3 product b 1 30 10
#4 product b 2 35 10