Javascript正则表达式匹配一些字符串,但在其他看似相同的字符串上失败

时间:2015-10-21 00:10:57

标签: javascript regex

JSFiddle

我正在使用Facebook的API从我县的警察部门页面提取每日犯罪报告。他们遵循一种大多数标准化的格式,下面的模式就是我的目标,以及一些烦人的不一致性:

  1. 标题位于3-4行之间,后跟两个换行符\n\n(代码将其删除,不属于下面的输出)
  2. 不同类别的犯罪组合在一起,第一行是描述犯罪类型的大写字符串。每个类别由其上方的两个新行字符\n\n分隔。
  3. 实际犯罪行为遵循上述类别标题,每个(大部分时间)由一个新行字符\n分隔
  4. 作为复制和粘贴的“神器”,有几次用连字符代替连字符,包括\u2013\u2014\u2015
  5. 报告的所有犯罪都以字符串“BEAT”开头,或者在极少数情况下以“Beat”开头
  6. 我遇到的问题是,有时下面的代码会捕获上面#2中详述的类别标题,但在其他帖子中,(看似)完全相同的字符串和环境无法捕获。我在服务中使用的角度代码可以在下面看到

    me.parsePosts = function() {
        var posts = facebookService.getRandomPosts(); // Just a method to return 5 random reports for now
        angular.forEach(posts, function(post) {
            // Some reports are incorrectly double spaced and inconsistent
            // with spacing and capitalization
            var fixedPost = post.message
                                .replace(/^Beat/, 'BEAT') // They were a little inconsistent back in the day
                                .replace('\n\n###', '') // All posts end with a useless ###
                                .replace('\u2013', '-') // Pesky unicode characters!
                                .replace('\u2014', '-')
                                .replace('\u2015', '-')
                                .replace('\n\nARRESTED', '\nARRESTED') // would help if this was consistent
                                .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports...
                postSplit = fixedPost.split('\n\n'), // split up the post into potential categories
                header = postSplit.splice(0,1); // I don't want the standard header of the post
    
            // Pass in postSplit .join()'d back together for debugging
            me.getCategoriesFromPost(postSplit, postSplit.join('\n\n'));
        });
    };
    
    me.getCategoriesFromPost = function(postArray, post) {
        var categoryRegexp = /[A-Z\-&\/: ]+$/,
            categories = [], uniqCategories = [];
    
        angular.forEach(postArray, function(a) {
            var split = a.split('\n'), // Extract the category from the list of crimes
                potentialCategory = split[0].trim(); // There's often an unwanted trailing space
    
            if (potentialCategory.match(categoryRegexp)) {
                categories.push(potentialCategory);
            }
        });
    
        // Every blue moon they repost a category twice, I just want one
        // and I'll merge the two together afterwards
        uniqCategories = categories.filter(function(a,b) {
            return categories.indexOf(a) == b;
        });
    
        console.log(uniqCategories); // log off all the categories in the post
        console.log(post); // Display the actual post so i can visibly verify it all worked
    };
    

    举个例子,在一篇文章中:

    console.log(uniqCategories);original raw text as received from facebookService.getRandomPosts()):

    BURGLARY COMMERCIAL
    BEAT E1 SPRINT WIRELESS, 7300 ASSATEAGUE DR, 3/19 0426: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole electronics. 14-25638
    BEAT D6 MONTPELIER LIQUORS, 7500 MONTPELIER RD, 3/19 0513: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole liquor, lottery tickets, and an ATM machine. 14-25641
    BEAT D4 MACY’S, 10300 LITTLE PATUXENT PKWY, 3/19 0501: Two unknown male suspects, wearing masks, gained entry to the business by breaking the glass door. The suspects were interrupted by a store employee and fled without taking anything. 14-25642
    SUSPECT VEHICLE: black Dodge pickup 
    
    BURGLARY NON COMMERCIAL
    BEAT B3 6600 ASPERN DR, 3/17 2354: Four suspects gained entry to the residence via unknown means. No sign of forced entry. 14-25220 
    ARRESTED:
    Karlin Lamont Harris, 23, of Pirch Way in Elkridge, charged with fourth-degree burglary
    Steven Lee Hubbard, 29, of Edgewater, charged with fourth-degree burglary
    Jessie Tyler Holt, 22, of Pine Tree Rd in Jessup, charged with fourth-degree burglary
    Brittney Victoria McEnaney, 26, of Pasadena, charged with fourth-degree burglary
    BEAT C1 6900 BENDBOUGH CT, 3/18 1400: Unknown suspect(s) gained entry to the residence via the front door. No sign of forced entry. The suspect(s) stole jewelry. 14-25392
    BEAT B4 7100 DEEP FALLS WAY, 3/18 1100-1440: Unknown suspect(s) gained entry to the residence by forcing a rear basement window. The suspect(s) stole jewelry and electronics. 14-25404 
    
    VEHICLE THEFT & ATTEMPTS
    BEAT E2 7-11, 9600 WASHINGTON BLVD, 3/18 0409: 
    05 Acura Tag 1AV8629 14-25277 (Keys left in vehicle.)
    

    console.log(post);返回

    ["BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]
    

    然而在另一篇文章console.log(uniqCategories);original raw text as received from facebookService.getRandomPosts()):

    ROBBERY COMMERCIAL
    BEAT B3 ZIPS DRY CLEANING, 6500 OLD WATERLOO RD, 3/22 1900: An unknown suspect entered the business through an unlocked rear door. The suspect threatened an employee and demanded cash. The employee complied. The suspect fled the business. 14-26959 
    SUSPECT: B/M, 5’8-5’9, black hoodie and pants, backpack 
    
    ROBBERY NON COMMERCIAL
    BEAT E7 7-11 PARKING LOT, 9100 MAIER RD, 03/23 1632: Suspect stole cash from an acquaintance and caused an abrasion with an unknown sharp object. Police are investigation the possibility it may be drug related. 14-27243 
    SUSPECT: B/M, 5’8, 200 lbs, dreadlocks
    
    BURGLARY COMMERCIAL
    BEAT E1 MEGATELECOM, 8600 WASHINGTON BLVD #106, 3/22 0933: Unknown suspect(s) gained entry to the business by breaking a window. The suspect(s) stole electronics. 14-26793
    BEAT F3 CATTAIL CREEK COUNTRY CLUB, 3600 CATTAIL CREEK DR, 03/22 1600- 03/23 0630: Unknown suspect(s) gained entry to a garage through an unlocked door. The suspect(s) stole golf carts. 14-27127
    
    BURGLARY NON COMMERCIAL
    BEAT E2 9300 BREAMORE CT, 03/21 1210 ATTEMPT: Two suspects attempted to gain entry via a rear slider. The resident yelled and the suspects fled, but were later caught by police. 14-26458
    ARRESTED:
    Travis Donte Mackell, 23, of Baltimore, charged with fourth-degree burglary
    Maurice Debuiel Aye, 26, of Baltimore, charged with fourth-degree burglary
    BEAT D3 5500 COLUMBIA RD, 3/21: An unknown suspect gained entry to the residence through an unlocked rear slider. The suspect woke the resident, who ultimately got the suspect to leave. It appears he may have entered the wrong residence. 14-26712 
    SUSPECT: B/M, 5’8, 200 lbs
    BEAT B4 7500 HEARTHSIDE WAY, 3/22 1700- 1800: Three unknown black male suspects stole a bicycle, which was unsecured on a bike rack. 14-27185
    BEAT E3 9100 BRYANT AVE, 3/23 2213: Unknown suspects gained entry to the residence by prying open the kitchen window. Nothing appeared to be taken. 14-27308
    BEAT B3 8000 KEETON RD, 3/23 1930- 2230: Unknown suspect(s) gained entry to the residence through an unlocked window. The suspect(s) stole a computer and jewelry. 14-27314
    BEAT A3 9000 FREDERICK RD, 3/23 0205: The suspect kicked in an acquaintance’s door after a verbal altercation and assaulted him. 14-27361 
    ARRESTED: Michael Wilson Sittig, 34, of Frederick Road in Ellicott City, charged with second-degree assault, third- and fourth-degree burglary, malicious destruction of property, and disorderly conduct
    
    VEHICLE THEFT & ATTEMPTS
    BEAT D2 5100 ELIOTS OAK DR, 03/22 2130- 3/23 0700: 
    12 Hyundai Sonata Red MD 5AN2945 14-27135
    

    console.log(post)仅返回:

    ["ROBBERY COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]
    

    我希望它返回["ROBBERY COMMERCIAL", "ROBBERY NON COMMERCIAL", "BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

    在这种情况下,很明显我的代码与BURGLARY COMMERCIALBURGLARY NON COMMERCIAL的前一个实例匹配,但后者不匹配。是什么赋予了?另外,请随意纠正我并告诉我,我在.replace()的墙上做错了,并且如果有的话,还有更好的方法。非常感谢帮助!

2 个答案:

答案 0 :(得分:2)

String.replace取代第一次出现。您需要使用正则表达式更改所有String.replace以替换所有匹配项。这样的事情(虽然我不确定unicode字符在正则表达式中是如何工作的):

post.message
  .replace(/^Beat/ig, 'BEAT') // They were a little inconsistent back in the day
  .replace('/\n\n###/g', '') // All posts end with a useless ###
  .replace('/\u2013/g', '-') // Pesky unicode characters!
  .replace('/\u2014/g', '-')
  .replace('/\u2015/g', '-')
  .replace('/\n\nARRESTED/g', '\nARRESTED') // would help if this was consistent
  .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports...

答案 1 :(得分:1)

在拆分之前,您错过了一些分隔符替换。也就是说,我补充道:

post.message
...
.replace( /\s*\n\s\n/g, '\n\n')
.replace(/\s BEAT/g, 'BEAT') ... 

参见 updated fiddle

TL; DR; (根据评论更新)

如果您在原始replace(...)函数调用之后以及.split('\n\n')之前查看消息,其中一些在最后有空格,后跟换行符,然后是另一个空白,并且换行符。

您的原始replace()都没有照顾到这一点。此外,有些只有换行符,空白,换行符模式(以及为什么正则表达式中的第一个空格有*)。然后,邮件中的一些BEAT关键字前面有一个或多个空格,因此我们将删除这些关键字,以确保BEAT始终以换行符开头。

如果您取消注释小提琴中的日志记录行并注释掉修复,您将在每一步看到元素数组。

在其中一个中,您将看到一个数组元素不仅包含我们期望的内容(一个报告),而且还包含下一个类别(这就是为什么您会看到更少的内容)。

然后我只是试着看看这些行结尾有什么不同,并检查replace()函数是否在split(...)调用之前处理它们......

如果您希望我更好地解释,请告诉我。