Python-使用bs4搜索特定的“ var”

时间:2018-12-26 10:13:16

标签: python regex beautifulsoup

所以我一直在尝试使用scrape来学习一些内容,在这里我设法对一个站点返回了很多不同的var值,例如:

var FancyboxI18nClose = 'Close';
var FancyboxI18nNext = 'Next';
var FancyboxI18nPrev = 'Previous';
var PS_CATALOG_MODE = false;
var added_to_wishlist = '.';
var ajax_allowed = true;
var ajaxsearch = true;
var attribute_anchor_separator = '-';
var attributesCombinations = [{"id_attribute":"100","id_attribute_group":"1","attribute":"38_5"},{"id_attribute":"101","id_attribute_group":"1","attribute":"39"},{"id_attribute":"103","id_attribute_group":"1","attribute":"40"},{"id_attribute":"104","id_attribute_group":"1","attribute":"40_5"},{"id_attribute":"105","id_attribute_group":"1","attribute":"41"},{"id_attribute":"107","id_attribute_group":"1","attribute":"42"},{"id_attribute":"108","id_attribute_group":"1","attribute":"42_5"},{"id_attribute":"109","id_attribute_group":"1","attribute":"43"},{"id_attribute":"111","id_attribute_group":"1","attribute":"44"},{"id_attribute":"112","id_attribute_group":"1","attribute":"44_5"},{"id_attribute":"132","id_attribute_group":"1","attribute":"45"},{"id_attribute":"113","id_attribute_group":"1","attribute":"46"}];

当然还有很多,它们都包含在var中。但是我要做的是只能刮取其中一个值- var attributesCombinations ,这意味着我基本上只想打印该值,然后我可以在其中刮擦的地方使用json.loads json也更容易。

我试图做的是以下事情:

try:
    product_li_tags = bs4.find_all(text=re.compile('attributesCombinations'))
except Exception:
    product_li_tags = []

但这给了所有“ var”开始到attributesCombinations的结果。

['var CUSTOMIZE_TEXTFIELD = 1;\nvar FancyboxI18nClose = \'Close\';\nvar FancyboxI18nNext = \'Next\';\nvar FancyboxI18nPrev = \'Previous\';\nvar PS_CATALOG_MODE = false;\nvar added_to_wishlist = \'The product was successfully added to your wishlist.\';\nvar ajax_allowed = true;\nvar ajaxsearch = true;\nvar allowBuyWhenOutOfStock = false;\nvar attribute_anchor_separator = \'-\';\nvar attributesCombinations = [{"id_attribute":"100","id_attribute_group":"1","att...........

我如何做到这一点,使其仅打印出 var attributesCombinations

2 个答案:

答案 0 :(得分:1)

attributesCombinations到语句末尾提取(正好)部分的正则表达式是

var attributesCombinations = (\[.*?\])

在Python中,您可以轻松地创建正则表达式,

re.compile(r'var attributesCombinations = (\[.*?\])');

答案 1 :(得分:1)

不要在bs4中使用re.compile,直接运行它。

match = re.compile('var\s*attributesCombinations\s*=\s*(\[.*?\])').findall(htmlString)
attributesCombinations = json.loads(match[0])
print(attributesCombinations)