正则表达式从字符串中删除不需要的文本

时间:2014-08-28 12:23:32

标签: regex sed cut

我试图从像

这样的大字符串中提取少量信息
[[["좋은","good","joh-eun",""]],[["adjective",[["좋은",["good","nice","pretty","admirable","canny","tenacious"],,0.38553435]],"good",4],["adverb",["훌륭하게",["wonderfully","good","nicely","beautifully","fine","finely"],,0.00029145498],"good",4]]]

我想像这样提取字符串

좋은 - good
좋은 - good,nice,pretty,admirable,canny,tenacious (basically adjectives)
훌륭하게 - wonderfully,good,nicely,beautifully,fine,finely (adverbs)

请帮助我尝试使用sed和pipe切割像

cut --delimiter='"' -f 1-2 and then use sed 's/\[\[\[\"//'

结果我给了我第一个韩语좋은,我无法扩展这个以获得理想的结果! 如果有任何其他更好的方法来实现这一点,请建议。 提前谢谢。

2 个答案:

答案 0 :(得分:2)

有点晚了但是纯正的正则表达适合sed:

正则表达式:\[\[\["(.*?)","(.*?)"\]\],\[\["(.*?)",\[\["(.*?)",\["(.*?)"\],.*?\]\],.*?\],\["(.*?)",\["(.*?)",\["(.*)"\],.*\]\]\]

替换:\1 - \2\n\4 - \5 (\3)\n\7 - \8 (\6)

demo

假设orignal line中总是有形容词和副词括号......(即使是空的)

请参阅演示中的替换以了解如何重新匹配。

答案 1 :(得分:1)

这是一块红宝石,但可能任何配备PCRE的工具都可以做类似的事情:

ruby -ne '
    $_.gsub(/"/,"")
      .scan(/ (\p{Hangul}+) ,\[? (.+?) \] /x) {|m| puts m[0] + " - " + m[1]}
' <<END
[[["좋은","good","joh-eun",""]],[["adjective",[["좋은",["good","nice","pretty","admirable","canny","tenacious"],,0.38553435]],"good",4],["adverb",["훌륭하게",["wonderfully","good","nicely","beautifully","fine","finely"],,0.00029145498],"good",4]]]
END
좋은 - good,joh-eun,
좋은 - good,nice,pretty,admirable,canny,tenacious
훌륭하게 - wonderfully,good,nicely,beautifully,fine,finely

太糟糕了,原始文本不容易处理JSON。

感谢this question了解如何匹配韩语字符。

相关问题