我试图使用python和regex在下面的示例网站中提取价格但是没有得到任何结果。
我怎样才能最好地捕捉价格(我不关心美分,只关注美元金额)?
http://www.walmart.com/store/2516/search?dept=4044&dept_name=Home&query=43888060
相关HTML:
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
正则表达式将会捕获&#34; 299&#34;或者是一个更容易获得这个的途径?谢谢!
答案 0 :(得分:0)
答案 1 :(得分:0)
不要使用正则表达式使用像bs4这样的html解析器:
from bs4 import BeautifulSoup
h = """<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>"""
soup = BeautifulSoup(h)
amount = soup.select_one("div.price-display.csTile-price span.sup").next_sibling.strip()
哪个会给你:
299
或者使用currency-delimiter
范围并获取上一个元素:
amount = soup.select_one("span.currency-delimiter").previous.strip()
哪个会给你相同的。你问题中的html也是通过 Javascript 动态生成的,所以你不会使用urllib.urlopen
来获取它,它根本就不在返回的源中。
您需要使用{em> selenium 或使用requests模仿以下ajax调用。
import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
data = json.loads(js['searchResults'])
from pprint import pprint as pp
pp(data)
那给你一些json:
{u'algo': u'polaris',
u'blacklist': False,
u'cluster': {u'apiserver': {u'hostname': u'dfw-iss-api8.stg0',
u'pluginVersion': u'2.3.0'},
u'searchengine': {u'hostname': u'dfw-iss-esd.stg0.mobile.walmart.com'}},
u'count': 1,
u'offset': 0,
u'performance': {u'enrichment': {u'inventory': 70}},
u'query': {u'actualQuery': u'43888060',
u'originalQuery': u'43888060',
u'suggestedQueries': []},
u'queryTime': 181,
u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
u'images': {u'largeUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180',
u'thumbnailUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180'},
u'inventory': {u'isRealTime': True,
u'quantity': 1,
u'status': u'In Stock'},
u'isWWWItem': True,
u'location': {u'aisle': [], u'detailed': []},
u'name': u'Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01',
u'price': {u'currencyUnit': u'USD',
u'isRealTime': True,
u'priceInCents': 29900},
u'productId': {u'WWWItemId': u'43888060',
u'productId': u'2FY1C7B7RMM4',
u'upc': u'88560900430'},
u'ratings': {u'rating': u'4.721',
u'ratingUrl': u'http://i2.walmartimages.com/i/CustRating/4_7.gif'},
u'reviews': {u'reviewCount': u'1436'},
u'score': u'0.507073'}],
u'totalCount': 1}
这会为您提供您可能需要的所有信息,您所做的就是将您在网址中的参数和商店号码发布到http://www.walmart.com/store/ajax/search
。
获取价格和名称:
In [22]: import requests
In [23]: import json
In [24]: js = requests.post("http://www.walmart.com/store/ajax/search",
....: data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
In [25]: data = json.loads(js['searchResults'])
In [26]: res = data["results"][0]
In [27]: print(res["name"])
Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01
In [28]: print(res["price"])
{u'priceInCents': 29900, u'isRealTime': True, u'currencyUnit': u'USD'}
In [29]: print(res["price"]["priceInCents"])
29900
In [30]: print(res["price"]["priceInCents"]) / 100
299
答案 2 :(得分:-1)
好的,只需搜索数字(我添加了$和。)并将结果连接成一个字符串(我用“”.join())。
>>> txt = """
<div class="price-display csTile-price">
<span class="sup">$</span>
299
<span class="currency-delimiter">.</span>
<span class="sup">00</span>
</div>
"""
>>> ''.join(re.findall('[0-9$.]',txt.replace("\n","")))
'$299.00'