在R中使用具有多个条件的gsub函数

时间:2012-11-20 11:10:45

标签: r search copy gsub

关注问题Searching for unique values in dataframe and creating a table with them

以下是我的数据的样子

    UUID    Source
1   Jane    http//mywebsite.com44bb00?utm_source=ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
2   Mike    http//mywebsite.com44bb00?utm_source=Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
3   John    http//mywebsite.com44bb00?utm_source=Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
4   Sarah   http//mywebsite.com44bb00?utm_source=Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
5   Michael http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
6   Bob     http//mywebsite.com44bb00?utm_source=ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
7   Mark    http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
8   Anna    http//mywebsite.com44bb00?utm_source=Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234

这是我想要实现的目标

    NAME    UTM_SOURCE  UTM_MEDIUM  UTM_CAMPAIGN
1   Jane    ADW             banner     Monk
2   Mike    Google          cpc        DOG
3   John    Yahoo           banner     DOG
4   Sarah   Faceboo         cpc        CAT
5   Michael Twitter         GDN        CAT
6   Bob     ADW             GDN        DOG
7   Mark    Twitter         banner     MONK
8   Anna    Facebook        banner     MONK

换句话说,我想要的是根据标准获取特定信息。示例:在数据框中搜索值“utmsource =”,找到后,复制“=”和“&”之间的任何信息。迹象。对于用户no1(Jame),如果查看原始文件,则她的源URL包含值“utm_source = ADW”。在输出文件中,“ADW”位被提取并插入名为“utm_source”的新列中。所有其他用户和其他维度的相同原则(utm_medium& utm_campaign)

我知道函数gsub可以帮助我。这是我到目前为止所尝试的:

> file1 <- read.csv("C:/Users/Dumitru Ostaciu/Desktop/Users.csv")
> file1 <- transform(file1, Source = as.character(Source))
> file2 <- gsub(".*\\?utm_source=", "", file1$Source)

这是我得到的结果

  UUID  SOURCE
    1   ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
    2   Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
    3   Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
    4   Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
    5   Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
    6   ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
    7   Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
    8   Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234   

我有两个问题:

1)在我得到的输出中,该函数复制了值“utm_source-”之后的所有内容。如何添加另一个维度以使公式仅复制“=”和“&amp;”之间的内容

2)我如何保留最初在第一列(UUID),Jane,Mike,John等中的值?

2 个答案:

答案 0 :(得分:1)

你需要做两件事:

  1. 使用gsub从您的来源中删除网站名称
  2. 在每次出现strsplit
  3. 时,使用?分隔剩余的字符串

    读入数据:

    x <- read.table(text="
    UUID    Source
    1   Jane    http//mywebsite.com44bb00?utm_source=ADW&utm_medium=banner&utm_campaign=Monk&gclid1234
    2   Mike    http//mywebsite.com44bb00?utm_source=Google&utm_medium=cpc&utm_campaign=DOG&gclid1234
    3   John    http//mywebsite.com44bb00?utm_source=Yahoo&utm_medium=banner&utm_campaign=DOG&gclid1234
    4   Sarah   http//mywebsite.com44bb00?utm_source=Facebookdw&utm_medium=cpc&utm_campaign=CAT&gclid1234
    5   Michael http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=GDNr&utm_campaign=CAT&gclid1234
    6   Bob     http//mywebsite.com44bb00?utm_source=ADW&utm_medium=GDN&utm_campaign=DOG&gclid1234
    7   Mark    http//mywebsite.com44bb00?utm_source=Twitter&utm_medium=banner&utm_campaign=MONK&gclid1234
    8   Anna    http//mywebsite.com44bb00?utm_source=Facebook&utm_medium=banner&utm_campaign=MONK&gclid1234", header=TRUE, stringsAsFactors=FALSE)
    

    使用strsplit分隔每个?的源字符串:

    z <- matrix(
      unlist(strsplit(gsub(".*\\?", "", x$Source), "\\&")), 
      ncol=4, byrow=TRUE)
    cbind(x$UUID, gsub(".*=", "", z))
    
         [,1]      [,2]         [,3]     [,4]   [,5]       
    [1,] "Jane"    "ADW"        "banner" "Monk" "gclid1234"
    [2,] "Mike"    "Google"     "cpc"    "DOG"  "gclid1234"
    [3,] "John"    "Yahoo"      "banner" "DOG"  "gclid1234"
    [4,] "Sarah"   "Facebookdw" "cpc"    "CAT"  "gclid1234"
    [5,] "Michael" "Twitter"    "GDNr"   "CAT"  "gclid1234"
    [6,] "Bob"     "ADW"        "GDN"    "DOG"  "gclid1234"
    [7,] "Mark"    "Twitter"    "banner" "MONK" "gclid1234"
    [8,] "Anna"    "Facebook"   "banner" "MONK" "gclid1234"
    

    然后转换为数据框并添加名称:

    z <- matrix(
      unlist(strsplit(gsub(".*\\?", "", x$Source), "\\&")), 
      ncol=4, byrow=TRUE)
    z <- cbind(x$UUID, gsub(".*=", "", z))
    z <- as.data.frame(z[, -5])
    names(z) <- c("UUID", "UTM_SOURCE", "UTM_MEDIUM", "UTM_CAMPAIGN")
    z
    
         UUID UTM_SOURCE UTM_MEDIUM UTM_CAMPAIGN
    1    Jane        ADW     banner         Monk
    2    Mike     Google        cpc          DOG
    3    John      Yahoo     banner          DOG
    4   Sarah Facebookdw        cpc          CAT
    5 Michael    Twitter       GDNr          CAT
    6     Bob        ADW        GDN          DOG
    7    Mark    Twitter     banner         MONK
    8    Anna   Facebook     banner         MONK
    

答案 1 :(得分:1)

我是这样做的

> file1 <- read.csv("C:/Users/Dumitru Ostaciu/Desktop/Users.csv")
> file1 <- transform(file1, Source = as.character(Source))
> z <- matrix(
     unlist(strsplit(gsub(".*\\?", "", file1$Source), "\\&")), 
     ncol=4, byrow=TRUE)
> file2 <- cbind(file1$UUID, gsub(".*=", "", z))

这是我得到的结果

    V1  V2          V3      V4      V5
1   3   ADW         banner  Monk    gclid1234
2   7   Google      cpc     DOG     gclid1234
3   4   Yahoo       banner  DOG     gclid1234
4   8   Facebookdw  cpc     CAT     gclid1234
5   6   Twitter     GDNr    CAT     gclid1234
6   2   ADW         GDN     DOG     gclid1234
7   5   Twitter     banner  MONK    gclid1234
8   1   Facebook    banner  MONK    gclid1234

我需要指出,我的真实数据将有500.000行,在第一列中将有一个唯一的名称。

如何修复此问题以使名称显示在V1中?我的错是什么?