我有一些用漂亮的汤解析的页面。但我有js代码:
<script type="text/javascript">
var utag_data = {
customer_id : "_PHL2883198554",
customer_type : "New",
loyalty_id : "N",
declined_loyalty_interstitial : "false",
site_version : "Desktop Site",
site_currency: "de_DE_EURO",
site_region: "uk",
site_language: "en-GB",
customer_address_zip : "",
customer_email_hash : "",
referral_source : "",
page_type : "product",
product_category_name : ["Lingerie"],
product_category_id :[jQuery("meta[name=defaultParent]").attr("content")],
product_id : ["5741462261401"],
product_image_url : ["http://images.urbanoutfitters.com/is/image/UrbanOutfitters/5741462261401_001_b?$detailmain$"],
product_brand : ["Pretty Polly"],
product_selling_price : ["20.0"],
promo_id : "6",
product_referral : ["WOMENS-SHAPEWEAR-LINGERIE-SOLUTIONS-EU"],
product_name : ["Pretty Polly Shape It Up Tummy Shaping Camisole"],
is_online_only : true,
is_back_in_stock : false
}
</script>
如何从此输入中获取某些值? 我应该像文本一样处理这个例子吗?我的意思是将它写入某个变量并拆分然后获取一些数据?
由于
答案 0 :(得分:4)
通过
获得脚本文本后js_text = soup.find('script', type="text/javascript").text
例如,。然后你可以使用正则表达式来查找数据,我确信有一种更简单的方法可以做到这一点,但正则表达式也不应该很难。
import re
regex = re.compile('\n^(.*?):(.*?)$|,', re.MULTILINE) #compile regex
js_text = re.findall(regex, js_text) # find first item @ new line to : and 2nd item @ from : to the end of the line or ,
js_text = [jt.strip() for jt in js_text] # to strip away all of the extra white space.
这将返回名称|值| name2 | value2 ...中的名称和值列表,您可以稍后将其弄乱或转换为字典。