使用正则表达式

时间:2016-05-09 13:28:16

标签: regex r import gsub qualtrics

我想在保存为以逗号分隔的长串变量标签中删除逗号和破折号之间的文字。这是我的字符串的最小例子:

myvarlabels <- ("participant number, How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")

重要的是,变量标签以两种不同的形式出现,应按以下方式缩短:

  • 您喜欢以下产品 - 绿茶
  • 应简化为: 绿茶
  • \“如果有的话,你愿意为这些产品支付多少钱...... - 日本,中国和印度绿茶” < / strong>
  • 应简化为: \“日本,中国和印度绿茶”

我尝试使用 gsub 正则表达式来识别并删除逗号和短划线之间的文本(即用“”替换文本)。

有没有人建议如何使用gsub 删除逗号之间的文字,表示新列的开头和破折号后面跟着在保留双引号时我想保留的文字?

编辑1

更准确地说,数据包括三种逗号分隔的文本块。它们都指定了相应变量包含的信息:

  1. 简短说明,包括一个或多个字词(例如,参与者编号)

  2. 更长的描述,相关信息仅在短划线后出现(例如,您喜欢以下产品 - 绿茶多少)

  3. 与上述相同,但在破折号之前的某处有逗号(例如,多少,如果有的话,你会......);这就是为什么这种类型的文本前面和后面跟着“”(否则它们没有被正确读取)

  4. 与上述相同,但在破折号前没有逗号(例如,您对以下产品有多少经验)
  5. 四种类型的文本序列都以逗号开头,后面都有逗号,可以按任何顺序出现。

    这是一个新的最小例子,比我的第一个例子更准确地反映了真实数据:

    (myvarlabels3 <- ("participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"How much experience do you have with the following products-Indian spices\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\",email,telephone number"))
    

    Cath的代码(编辑2)可以达到某一点。当我在字符串的开头添加更多“简单”类型1文本序列时,或者当我在上面的列表中添加4.下指定的文本序列时,代码将不再正常工作。

    然而,当编辑2中Cath的代码分两步运行时,它就能完美运行:

    myvarlabels3 <- gsub("((?<=,\")[^-]*[^-]+-)|((?<=,\")[^-],*[^-]+-)", "", myvarlabels3, perl=TRUE) # step 1: shorten the text sequences specified under 3. and 4. in the list above
    
    [1] "participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"
    
    gsub("((?<=,)[^-\",]+-)", "", myvarlabels3, perl=TRUE) # step 2: shorten the text sequences specified as 2. in the above list
    
    [1] "participant number,age,gender,body mass index,green tea,beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"
    

    我认为可能只使用一行代码,但我无法弄清楚如何。无论如何,当我从Qualtrics导入凌乱的csv文件时,这将极大地方便我的工作流程。

1 个答案:

答案 0 :(得分:1)

我不确定我理解你想要的输出是什么,但你可以尝试发现新列的开始&#34;基于&#34;多少&#34;然后去,直到你见面#34;破折号:

gsub("(^[^,]+, )|(How much[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

修改

考虑您的模式,您可以尝试以下方法:

gsub("((?<=, )[^-\"]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "participant number, green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

我根据您描述的2种可能的模式使用2种可能的模式,并使用后面的内容来指定应该存在的模式,但需要保留

<强> EDIT2

如果您在逗号和不以引号开头的问题之间没有空格,则可以执行以下操作:

myvarlabels_2 <- ("participant number,How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")
gsub("((?<=,)[^-\",]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels_2, perl=TRUE)
[1] "participant number,green tea,beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""