我需要在JavaScript中使用正则表达式的帮助。我有以下字符串(它没有换行符):
var str = 'DetailedLog 18.11.2015 14:41:35.299 Neutral : 0,5704 Happy : 0,6698 Sad : 0,0013 Angry : 0,0040 Surprised : 0,0129 Scared : 0,0007 Disgusted : 0,0048 Valence : 0,6650 Arousal : 0,2297 Gender : Male Age : 20 - 30 Beard : None Moustache : None Glasses : Yes Ethnicity : Caucasian Y - Head Orientation : -1,7628 X - Head Orientation : 2,5652 Z - Head Orientation : -3,0980 Landmarks : 375,4739 - 121,6879 - 383,2627 - 113,6502 - 390,8202 - 110,3507 - 396,1021 - 109,7039 - 404,9615 - 110,9594 - 443,2603 - 108,9765 - 451,9454 - 106,7192 - 457,1207 - 106,8835 - 464,1162 - 109,5496 - 470,9659 - 116,8992 - 387,4940 - 132,0171 - 406,4031 - 130,4482 - 441,6239 - 128,6356 - 460,6862 - 128,1997 - 419,0713 - 161,6479 - 425,3519 - 155,1223 - 431,9862 - 160,6411 - 406,9320 - 190,3831 - 411,4790 - 188,7656 - 423,1751 - 185,6583 - 428,5339 - 185,6882 - 433,7802 - 184,8167 - 445,6192 - 186,3515 - 450,8424 - 187,2787 - 406,0796 - 191,1880 - 411,9287 - 193,5352 - 417,9666 - 193,6567 - 424,0851 - 193,4941 - 428,6678 - 193,5652 - 433,2172 - 192,7540 - 439,3548 - 192,0136 - 445,4181 - 191,1532 - 451,6007 - 187,9486 - 404,5193 - 190,6352 - 412,8277 - 185,4609 - 421,1355 - 181,2883 - 428,3182 - 181,1826 - 435,2024 - 180,2258 - 443,9292 - 183,2533 - 453,1117 - 187,2288 - 405,9689 - 193,2750 - 410,0249 - 199,8118 - 416,0457 - 203,0374 - 423,4839 - 204,1818 - 429,9247 - 204,2175 - 436,3620 - 203,1305 - 443,4268 - 200,9355 - 448,9572 - 197,1335 - 452,0746 - 190,0314 Quality : 0,8137 Mouth : Closed Left Eye : Open Right Eye : Open Left Eyebrow : Lowered Right Eyebrow : Lowered Identity : NO IDENTIFICATION';
我的目标是从这个混乱中构造一个可用的JavaScript对象,包含属性及其值。我试图使用正则表达式,因为据我所知,它们比使用custum for循环解析更快。执行此操作的代码需要很快。
对于属性名称,我尝试使用以下代码构造字符串数组:
str.match(/(\b[A-Z].*?\b)(?=(\s(:|\d)))/g);
这已经过时了:
["DetailedLog", "Neutral", "Happy", "Sad", "Angry", "Surprised", "Scared",
"Disgusted", "Valence", "Arousal", "Gender", "Male Age", "Beard", "None Moustache",
"None Glasses", "Yes Ethnicity", "Caucasian Y - Head Orientation", "X - Head Orientation",
"Z - Head Orientation", "Landmarks", "Quality", "Mouth", "Closed Left Eye",
"Open Right Eye", "Open Left Eyebrow", "Lowered Right Eyebrow", "Lowered Identity"]
在这里,我遇到了一个字符串的问题,这些字符串由两个大写单词组成,如"男性时代"或者"打开左眉毛"或"封闭左眼"。我将用于属性值的第一个单词,因此它会妨碍...
我的第一个问题是给出这个输出的正确正则表达式是什么:
["DetailedLog", "Neutral", "Happy", "Sad", "Angry", "Surprised", "Scared",
"Disgusted", "Valence", "Arousal", "Gender", "Age", "Beard", "Moustache",
"Glasses", "Ethnicity", "Y - Head Orientation", "X - Head Orientation",
"Z - Head Orientation", "Landmarks", "Quality", "Mouth", "Left Eye",
"Right Eye", "Left Eyebrow", "Right Eyebrow", "Identity"]
感谢您的帮助。
答案 0 :(得分:1)
(?:(DetailedLog) ([^ ]+ [^ ]+)|(\b[A-Z][A-Za-z -]+?) : ((?:(?:-?[\d,]+)(?: - -?[\d,]+)*|(?:(?:[A-Z ]+\b|[A-Za-z]+)))))(?:$| )
https://regex101.com/r/lP9pG2/3
这里的基本想法是因为我们不知道" key"我们开始尝试定义"值"更确切地说,当我们知道价值结束时停止捕捉。
DetailedLog
将始终跟随由空格分隔的2组字符,这些字符包括空格将被视为值。Happy
值将是以下之一:
-
分隔的一个或多个正数或负数或数字。注意最后一个"所有大写字符和空格的序列"是专门捕获Identity
的最后一部分NO IDENTIFICATION
。 Identity
的值或任何其他可能只包含字母和空格的值可能会导致问题,如果它们不是全部大写的话。
var result = {};
var myregexp = /(?:(DetailedLog) ([^ ]+ [^ ]+)|(\b[A-Z][A-Za-z -]+?) : ((?:(?:-?[\d,]+)(?: - -?[\d,]+)*|(?:(?:[A-Z ]+\b|[A-Za-z]+)))))(?:$| )/g;
var match = myregexp.exec(str);
while (match != null) {
if (match[1]) {
result[match[1]] = match[2];
} else {
result[match[3]] = match[4];
}
match = myregexp.exec(str);
}
这导致result
包含以下对象:
{
"DetailedLog": "18.11.2015 14:41:35.299",
"Neutral": "0,5704",
"Happy": "0,6698",
"Sad": "0,0013",
"Angry": "0,0040",
"Surprised": "0,0129",
"Scared": "0,0007",
"Disgusted": "0,0048",
"Valence": "0,6650",
"Arousal": "0,2297",
"Gender": "Male",
"Age": "20 - 30",
"Beard": "None",
"Moustache": "None",
"Glasses": "Yes",
"Ethnicity": "Caucasian",
"Y - Head Orientation": "-1,7628",
"X - Head Orientation": "2,5652",
"Z - Head Orientation": "-3,0980",
"Landmarks": "375,4739 - 121,6879 - 383,2627 - 113,6502 - 390,8202 - 110,3507 - 396,1021 - 109,7039 - 404,9615 - 110,9594 - 443,2603 - 108,9765 - 451,9454 - 106,7192 - 457,1207 - 106,8835 - 464,1162 - 109,5496 - 470,9659 - 116,8992 - 387,4940 - 132,0171 - 406,4031 - 130,4482 - 441,6239 - 128,6356 - 460,6862 - 128,1997 - 419,0713 - 161,6479 - 425,3519 - 155,1223 - 431,9862 - 160,6411 - 406,9320 - 190,3831 - 411,4790 - 188,7656 - 423,1751 - 185,6583 - 428,5339 - 185,6882 - 433,7802 - 184,8167 - 445,6192 - 186,3515 - 450,8424 - 187,2787 - 406,0796 - 191,1880 - 411,9287 - 193,5352 - 417,9666 - 193,6567 - 424,0851 - 193,4941 - 428,6678 - 193,5652 - 433,2172 - 192,7540 - 439,3548 - 192,0136 - 445,4181 - 191,1532 - 451,6007 - 187,9486 - 404,5193 - 190,6352 - 412,8277 - 185,4609 - 421,1355 - 181,2883 - 428,3182 - 181,1826 - 435,2024 - 180,2258 - 443,9292 - 183,2533 - 453,1117 - 187,2288 - 405,9689 - 193,2750 - 410,0249 - 199,8118 - 416,0457 - 203,0374 - 423,4839 - 204,1818 - 429,9247 - 204,2175 - 436,3620 - 203,1305 - 443,4268 - 200,9355 - 448,9572 - 197,1335 - 452,0746 - 190,0314",
"Quality": "0,8137",
"Mouth": "Closed",
"Left Eye": "Open",
"Right Eye": "Open",
"Left Eyebrow": "Lowered",
"Right Eyebrow": "Lowered",
"Identity": "NO IDENTIFICATION"
}
myregexp
),这样正则表达式只会被编译一次。以下是一个示例: http://jsperf.com/image-features-log-parsing/5
请记住,此示例每次都会在循环中编译正则表达式。
答案 1 :(得分:0)
我没有足够的声誉来发表评论,因此我将提供部分解决方案。使用正则表达式:/(\b[A-Za-z -]+?) : (.+? )/g
,然后仅使用Capture Group 1.结果如下所示:https://regex101.com/r/qJ7jU7/1
唯一的缺点是没有捕获“详细日志”。
根据我的经验,并非所有数据都适合一次性使用Regex,有时您需要将其分解为多个部分。
答案 2 :(得分:0)
我认为字符串中的空格有太多用法来使用简单的正则表达式。即使剥离关键字也会导致以下混乱,分成几个步骤以使其更清晰:
str.replace(/([0-9])( - )([0-9])/g,"$1-$3") // get rid of spaces between landmarks hyphen
.replace(/\: [^ ]+/g,",") // get rid of values
.replace(/(DetailedLog)([0-9.: ]+)/,"$1, ") // get rid of date
.replace(/(Identity)(.*)/,"$1") // get rid of value of "identity"
您提出了一个简单的解析器,但如果您事先不知道关键字,则无法使用。如果你事先知道它们:只需构建那个简单的解析器并使用关键字作为分隔符。我相信它会比任何高度复杂的正则表达式更快。您可以使用JISON来避免一些麻烦。
啊,我太晚了。试。尽管如此,这是一个非常简单,未经优化的基准测试解析器:
// That's how I made the keys-array, not actively used here
str.replace(/([0-9])( - )([0-9])/g,"$1-$3")
.replace(/\: [^ ]+/g,",")
.replace(/(DetailedLog)([0-9.: ]+)/,"$1, ")
.replace(/(Identity)(.*)/,"$1")
.replace(/([^,]+)/g,"\"$1\"" )
.replace(/\" /g,"\"")
.replace(/ \"/g,"\"");
var keys = ["DetailedLog", "Neutral" , "Happy" , "Sad" , "Angry" , "Surprised" ,
"Scared" , "Disgusted" , "Valence" , "Arousal" , "Gender" , "Age" , "Beard" ,
"Moustache" , "Glasses" , "Ethnicity" , "Y - Head Orientation" ,
"X - Head Orientation" , "Z - Head Orientation" ,
"Landmarks" , "Quality" , "Mouth" , "Left Eye" ,
"Right Eye" , "Left Eyebrow" , "Right Eyebrow" , "Identity"];
var db = {};
var value;
for(var k = 0;k < keys.length - 1;k++){
var regex = new RegExp("("+keys[k] + "[ :]+)([^:]+)(" + keys[k+1] + ")");
value = str.match(regex);
if(value){
db[keys[k]] = value[2].trim();
}
}
// last one
db[keys[keys.length -1]] = value[2].trim();
// take a look
JSON.stringify(db)
对于几百行左右,它应该足够快,特别是如果你稍微优化一下(例如,预先计算正则表达式,它在循环中执行它有点愚蠢)但是在至少你有一个可以比较的基准,因为我不认为你可以在没有一点努力的情况下慢慢做。