如何使用Python或beautifulsoup

时间:2017-08-30 17:48:56

标签: python regex python-2.7 beautifulsoup

你可以帮我解决一下这件小事。我希望使用Beautiful soup(Python)或python从SCRIPT标签(不在Body中)的下面代码中提取 lat和lng 值。我是Python的新手,博客建议使用Beautiful soup进行提取。

我想要这两个值 lat:21.25335,lng:81.649445 我正在使用正则表达式。我的常规表达“^ l([a-t])(:)([0-9])([^,] +)”

检查此链接是否有正则表达式和html文件 - http://regexr.com/3glde

我用这个正则表达式得到这两个值,但我只想将那些lat和lng值(数字部分)存储在变量中。

下面是我正在使用的python代码

import re
pattern = re.compile("^[l]([a-t])([a-t])(\:) ([0-9])([^,]+)")

for i, line in enumerate(open('C:\hile_text.html')):
    for match in re.finditer(pattern, line):
        print 'Found on line %s: %s' % (i+1, match.groups())

输出:

  • 在第3218行找到:('a','t',':','2','1.244791')
  • 在第3219行找到:('n','g',':','8','1.643486')

我只希望将这些数值作为输出,如21.25335,81.649445,并希望将这些值存储在变量中,否则您可以为此提供备用代码。

很快就帮助了我。谢谢你的期待。

这是html文件中的脚本标记。

<script type="text/javascript">
    window.mapDivId = 'map0Div';
    window.map0Div = {
    lat: 21.25335,
    lng: 81.649445,
    zoom: null,
    locId: 5897747,
    geoId: 297595,
    isAttraction: false,
    isEatery: true,
    isLodging: false,
    isNeighborhood: false,
    title: "Aman Age Roll & Chicken ",
    homeIcon: true,
    url: "/Restaurant_Review-g297595-d5897747-Reviews-Aman_Age_Roll_Chicken-Raipur_Raipur_District_Chhattisgarh.html",
    minPins: [
    ['hotel', 20],
    ['restaurant', 20],
    ['attraction', 20],
    ['vacation_rental', 0]       ],
    units: 'km',
    geoMap: false,
    tabletFullSite: false,
    reuseHoverDivs: false,
    noSponsors: true    };
    ta.store('infobox_js', 'https://static.tacdn.com/js3/infobox-c-v21051733989b.js');
    ta.store("ta.maps.apiKey", "");
    (function() {
    var onload = function() {
    if (window.location.hash == "#MAPVIEW") {
    ta.run("ta.mapsv2.Factory.handleHashLocation", {}, true);
    }
    }
    if (window.addEventListener) {
    if (window.history && window.history.pushState) {
    window.addEventListener("popstate", function(e) {
    ta.run("ta.mapsv2.Factory.handleHashLocation", {}, false);
    }, false);
    }
    window.addEventListener('load', onload, false);
    }
    else if (window.attachEvent) {
    window.attachEvent('onload', onload);
    }
    })();
    ta.store("mapsv2.show_sidebar", true);
    ta.store('mapsv2_restaurant_reservation_js', ["https://static.tacdn.com/js3/ta-mapsv2-restaurant-reservation-c-v2430632369b.js"]);
    ta.store('mapsv2.typeahead_css', "https://static.tacdn.com/css2/maps_typeahead-v21940478230b.css");
    // Feature gate VR price pins on SRP map. VRC-14803
    ta.store('mapsv2.vr_srp_map_price_enabled', true);
    ta.store('mapsv2.geoName', 'Raipur');
    ta.store('mapsv2.map_addressnotfound', "Address not found");     ta.store('mapsv2.map_addressnotfound3', "We couldn\'t find that location near {0}.  Please try another search.");     ta.store('mapsv2.directions', "Directions from {0} to {1}");     ta.store('mapsv2.enter_dates', "Enter dates for best prices");     ta.store('mapsv2.best_prices', "Best prices for your stay");     ta.store('mapsv2.list_accom', "List of accommodations");     ta.store('mapsv2.list_hotels', "List of hotels");     ta.store('mapsv2.list_vrs', "List of holiday rentals");     ta.store('mapsv2.more_accom', "More accommodations");     ta.store('mapsv2.more_hotels', "More hotels");      ta.store('mapsv2.more_vrs', "More Holiday Homes");     ta.store('mapsv2.sold_out_on_1', "SOLD OUT on 1 site");     ta.store('mapsv2.sold_out_on_y', "SOLD OUT on 2 sites");   </script>

1 个答案:

答案 0 :(得分:0)

你的正则表达式有点搞砸了。 ^l表示你试图匹配一行中第一个字符'l'。

^\s+(l[an][gt])(:\s+)(\d+\.\d+)会更好用。检查一个regerx分析器工具,例如http://www.myezapp.com/apps/dev/regexp/show.ws,以获得正在发生的事情的细分。

这是一个细分

Sequence: match all of the followings in order BeginOfLine Repeat WhiteSpaceCharacter one or more times CapturingGroup GroupNumber:1 Sequence: match all of the followings in order l AnyCharIn[ a n] AnyCharIn[ g t] CapturingGroup GroupNumber:2 Sequence: match all of the followings in order : Repeat WhiteSpaceCharacter one or more times CapturingGroup GroupNumber:3 Sequence: match all of the followings in order Repeat Digit one or more times . Repeat Digit one or more times