正则表达式不尊重双引号

时间:2018-04-16 16:00:08

标签: javascript regex

我正在尝试拆分以下ELB条目:

2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0

使用此正则表达式:

const splitElbEntry = (elbLogEntry) => R.match(/\S+|"[^"]*"/g)(elbLogEntry.trim())

但似乎无法正常工作https://regex101.com/r/JOlrxS/1

我喜欢在双引号中保留任何内容,例如

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

1 个答案:

答案 0 :(得分:5)

更改选项的顺序:订单重要

为什么会这样?

正则表达式引擎将按照您提供的顺序尝试每个选项。 \S+|"[^"]*"始终首先尝试匹配\S+。如果\S+在字符串中的给定位置无法匹配,则会尝试第二个选项"[^"]*"

由于\S"匹配,因此第一个选项是与现有正则表达式匹配的唯一选项(您的第二个选项永远不会尝试),因此您可以将现有的正则表达式更改为\S+。展开下面的代码段,看看\S+|"[^"]*"\S+会产生相同的结果。

您的正则表达式\S+|"[^"]*"

var s = `2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0`
console.log(s.match(/\S+|"[^"]*"/g))

您的正则表达式已简化为\S+

var s = `2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0`
console.log(s.match(/\S+/g))

你如何解决这个问题?

更改选项的顺序会让正则表达式引擎首先尝试"[^"]*",然后,如果不匹配,则尝试\S+

See regex in use here

"[^"]*"|\S+

var s = `2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0`
console.log(s.match(/"[^"]*"|\S+/g))