Python正则表达式提取Lookahead

时间:2014-10-30 11:09:49

标签: python regex

我一直在尝试从网页抓取中提取传输节点名称和位置坐标字符串(我有权抓取)。名称和位置在javascript的cdata块中。见这里:http://pastebin.com/6Vtup2dE

在python中使用正则表达式

re.findall("(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?(?=new\ssimpleInfo\(\\\'))(.+?(?=\\)))", test_str)

我得到了

[(u'55.86527,-4.2517133',
  u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
 (u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''"),
 (u'51.492653,-0.14765126',
  u"new simpleInfo('Victoria, Buckingham Palace Rd, Stop 10','London Victoria, Buckingham Palace Road - at the corner of Elizabeth Bridge and diagonally across from the main entrance to Victoria Coach Station. megabus Oxford Tube services leave from Stop 10.'"),
 (u'51.492596,-0.14985295',
  u"new simpleInfo('Victoria Coach Station','London Victoria Coach Station is situated on Buckingham Palace Rd at the junction with Elizabeth St. megabus services depart from Stands 15-20, located in the departures area of North West terminal '"),
 (u'51.503437,-0.112076715',
  u"new simpleInfo('Waterloo Train Station','London Waterloo - London Waterloo Station is located on Station Approach, SE1 London - just behind the London Eye. The station is a terminus for trains serving the south-west of England and Eurostar services. Waterloo is the largest station in the UK by area. Its spacious, curved concourse is lined with shops and all the modern amenities.\\n'"),
 (u'51.53062,-0.12585254',
  u"new simpleInfo('St Pancras International Train Station','For East Midlands Trains services only. London St Pancras International, London - St Pancras Station is located on Pancras Rd NW1 between the national Library and Kings Cross station. The station is the terminus for trains serving East Midlands and South Yorkshire. It is also the new London terminal for the Eurostar services to the continent. Kings Cross St Pancras tube station provides links via the London underground to other London destinations.'"),
 (u'51.52678,-0.13297649',
  u"new simpleInfo('Euston Train Station','For Virgin Trains Services Only. London Euston - The station is the main terminal for trains to London from the West Midlands and North West England. It is connected to Euston Tube Station for easy access to the London Underground network'"),
 (u'51.52953,-0.12506014',
  u"new simpleInfo('St Pancras, Coach Road','In some instances megabusplus services which operate as coach only will pick up from Coach Road, outside London St Pancras.'"),
 (u'55.86527,-4.2517133',
  u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
 (u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''")]

但我想要的只是:

[(u'55.86527,-4.2517133','Buchanan Bus Station'),
     (u'55.86068,-4.257852', 'Central Train Station'),
     (u'51.492653,-0.14765126','Victoria, Buckingham Palace Rd, Stop 10'),
     (u'51.492596,-0.14985295','Victoria Coach Station')....etc]

我的时间里写了很多正则表达式,但我从未遇到过这样的问题。我正在尝试(信不信由你)隐藏所有内容,包括“new simpleInfo('然后将字符串抓到下一个”''但我无法解决。帮助!

2 个答案:

答案 0 :(得分:1)

试试这个:

re.findall(r"(?:\(new\sMicrosoft\.Maps\.Location\(([^)]+)\).+?new\ssimpleInfo\(\\?'(.+?)\\?')", test_str)

此正则表达式会查找是否存在\'Buchanan Bus Station\''Buchanan Bus Station'

以下是 demo

答案 1 :(得分:0)

(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\'([^'\\]+)

试试这个。这应该给你你想要的东西。

import re
p = re.compile(ur'(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\\'([^\'\\]+)')
test_str = u"jQuery(function(){ jQuery(\'#JourneyPlanner_txtOutboundDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1, beforeShowDay: function(dte){ return [((dte >= new Date(2014,9,29) && dte <= new Date(2015,0,4)) || false)]; }, minDate: new Date(2014,9,29), maxDate: new Date(2015,0,4),buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\njQuery(function(){ jQuery(\'#JourneyPlanner_txtReturnDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1,buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\nEmperorBing.addCallback(function(){ var map = new Microsoft.Maps.Map(document.getElementById(\'confirm1_Map1\'), {credentials:\'Aodb7Wd7D9Kq5gKNryfW6V29yf8aw2Sbu-tXAlkH7OLJtm8zG2bQzzhDKK5zM9FE\',height: 320,width: 299, zoom: 13, mapTypeId: Microsoft.Maps.MapTypeId.auto, enableClickableLogo: false , enableSearchLogo: false , showDashboard: true, showCopyright: true, showScalebar: true, showMapTypeSelector: true});\r\nEmperorBing.addMarker(map, new Microsoft.Maps.Pushpin(new Microsoft.Maps.Location(55.86527,-4.2517133), { undefined: undefined, icon:\'/images/mapmarker.gif\', width:42, height:42, anchor: new Microsoft.Maps.Point(21,21)}),new simpleInfo(\'Buchanan Bus Station\',\'Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information "

re.findall(p, test_str)

参见演示。

http://regex101.com/r/dP9rO4/9