将字符向量列细分为多列

时间:2019-05-19 11:06:12

标签: r dplyr text-mining tidyr data-manipulation

我有以下提示:

class nameClass:
    name = ""
    phoneNumber = ""
    jerseyNumber = ""

    def __int__(self, name, phoneNumber, jerseyNumber):
        self.name = name
        self.phoneNumber = phoneNumber
        self.jerseyNumber = jerseyNumber

    def setname(self, name):
        self.name = name
    def setphoneNumber(self, phoneNumber):
        self.phoneNumber = phoneNumber
    def setjerseyNumber(self, jerseyNumber):
        self.jerseyNumber = jerseyNumber

    def getname(self):
        return self.name
    def getphoneNumber(self):
        return self.phoneNumber
    def getjerseyNumber(self):
        return self.jerseyNumber
    def displayData(self):
        print(" ")
        print("Player information: ")
        print("-------------------------")
        print("Name: ", self.name)
        print("Phone number: ", self.phoneNumber)
        print("Jersey number: ", self.jerseyNumber)

def displayMenu():
    print("=========Main Menu=========")
    print("1. Display Roster.")
    print("2. Add Member.")
    print("3. Remove Member.")
    print("4. Edit Member.")
    print("9. Exit Program.")
    print(" ")
    return int(input("Selection > "))

def printMembers(members):
    if menuSelection == 1:
        print("Team Roster:")
        for x in members.keys():
            print("Name: ", x, "\tPhone Number: ", x, "\tJersey Number: ", members[x])
        print()

def addMember(members):
    if menuSelection == 2:
        newName = input("Enter new member's name: ")
        newphoneNumber = int(input("Enter member's phone number: "))
        newjerseyNumber = int(input("Enter member's jersey number: "))
        members[newName] = (newName, newphoneNumber, newjerseyNumber)
    return members

def removeMember(members):
    if menuSelection == 3:
        name = input("Enter the member's name you would like to remove: ")
        if name in members:
            del members[name]
        else:
            print("Member: ", name, "not found.")
    return members

def editMember(members):
    if menuSelection == 4:
        oldName = input("Enter the name of the member you would like to edit: ")
        if oldName in members:
            newName = input("Enter the member's new name: ")
            newphoneNumber = int(input("Member's new phone number: "))
            newjerseyNumber = int(input("Member's new jersey number: "))
            members[newName] = (newName, newphoneNumber, newjerseyNumber)
        else:
            print("No such member in memory.")
    return members

print("Welome to the Team Manager")
members = {}
menuSelection = displayMenu()

while menuSelection != 9:
    if menuSelection == 1:
        printMembers(members)
    elif menuSelection == 2:
        members = addMember(members)
    elif menuSelection == 3:
        members = removeMember(members)
    elif menuSelection == 4:
        members = editMember(members)
    menuSelection = displayMenu()
print("Exiting Program...")

我想根据颜色系列colours = tribble( ~all, c('blue','green', 'red', 'pink', 'yellow', 'gold', 'orange', 'ivory', 'brown', 'beige'), c('green', 'red', 'pink', 'orange', 'ivory', 'beige') ) CoolWarm将颜色分为多列,每个系列各有一个列。

我可以将Neutralmutatemap结合使用:

str_subset

但是我想知道是否有更简洁的方法来达到相同的结果?我尝试过colours %>% mutate( 'Cool' = map(all, ~str_subset(., '^(blue|green)$')), 'Warm' = map(all, ~str_subset(., '^(red|pink|yellow|gold|orange)$')), 'Neutral' = map(all, ~str_subset(., '^(ivory|brown|beige)$')) ) # A tibble: 2 x 4 all Cool Warm Neutral <list> <list> <list> <list> 1 <chr [10]> <chr [2]> <chr [5]> <chr [3]> 2 <chr [6]> <chr [1]> <chr [3]> <chr [2]> ,但似乎无法正确使用正则表达式:

tidyr::extract()

我猜这是不正确的,因为OR语句匹配每个组中的单个单词,而不是将字符串分成三个子字符串,每个子字符串包含每个组中所有匹配的单词? Here is the demo

1 个答案:

答案 0 :(得分:0)

我非常坚信extract无效,但是使用正确的正则表达式即可。它实际上并没有比您的第一个解决方案“简洁”得多,但我认为它可能尽可能地简洁。 (如果要缩短时间,可以考虑将颜色折叠为两个元素的字符向量,而不是将数据框折叠为带有列表列的数据。)

正则表达式模式的问题是您使用|。您想定位单词的集合,而不是“ x OR y OR z”,这就是您的模式所要做的,这就是为什么每行仅获得一个匹配项的原因。要创建可能的匹配项的集合,请使用[]。为“零个或多个”匹配添加*。使用上面的示例数据:

library(tidyverse)

colours %>% 
    mutate(all = map(all, str_c, collapse = " ")) %>% 
    extract(all, c("cool", "warm", "neutral"),
            "([blue green]*) ([red pink yellow gold orange]*) ([ivory brown beige]*)",
            remove = F # Include the `all` column.
    )

#### OUTPUT ####

# A tibble: 2 x 4
  all       cool       warm                        neutral          
  <list>    <chr>      <chr>                       <chr>            
1 <chr [1]> blue green red pink yellow gold orange ivory brown beige
2 <chr [1]> green      red pink orange             ivory beige      

主要警告是,颜色类别需要按正确的顺序排列,即字符串必须按coolwarm的顺序包含颜色词组→neutral。如果他们是随机的,那将行不通。实际上,如果颜色单词是随机的,我认为extract将不再有用,因为无法提取单个单词然后将它们连接起来。您也会丢失列表列-如果这对您很重要。

如果不能保证顺序,或者有可能缺少某些类别词,则可以执行以下操作。使用类别词的随机样本(请注意,我删除了列表列,以便您了解发生了什么事情):

col_rand <- tribble(
    ~all,
    sample(c('blue','green', 'red', 'pink', 'yellow', 'gold', 'orange', 'ivory', 'brown', 'beige'), 5),
    sample(c('green', 'red', 'pink', 'orange', 'ivory', 'beige'), 4)
) %>% 
    mutate(all = map(all, str_c, collapse = " ") %>% unlist())

#### OUTPUT ####

# A tibble: 2 x 1
  all                       
  <chr>                     
1 blue yellow red beige pink
2 ivory pink beige orange   

并具有以下模式:

patts <- c(cool = "blue|green",
           warm = "red|pink|yellow|gold|orange",
           neutral = "ivory|brown|beige"
           )

您可以执行以下操作,提取匹配项并将其连接起来,如果没有匹配项,则返回NA

library(magrittr)

unlist(col_rand$all) %>% 
    map_dfr(function(x) {str_extract_all(x, patts) %>%
            map(function(x) ifelse(length(x) == 0,
                                   NA,
                                   str_c(x, collapse = " ")
                                   )
                ) %>% 
            bind_cols()}) %>% 
    set_colnames(names(patts)) %>% bind_cols(col_rand, .)

#### OUTPUT ####

# A tibble: 2 x 4
  all                        cool  warm            neutral    
  <chr>                      <chr> <chr>           <chr>      
1 blue yellow red beige pink blue  yellow red pink beige      
2 ivory pink beige orange    NA    pink orange     ivory beige
  

请注意,magrittr库是为set_colnames加载的。如果您在magrittr / tidyverse之后加载tidyr,则需要使用上面的tidyr::extract(),因为两个库都具有extract函数。