Question

我试图使用python和regex在下面的示例网站中提取价格但是没有得到任何结果。

我怎样才能最好地捕捉价格（我不关心美分，只关注美元金额）？

http://www.walmart.com/store/2516/search?dept=4044&dept_name=Home&query=43888060

相关HTML：

<div class="price-display csTile-price">
       <span class="sup">$</span>
       299
       <span class="currency-delimiter">.</span>
       <span class="sup">00</span>
</div>

正则表达式将会捕获＆＃34; 299＆＃34;或者是一个更容易获得这个的途径？谢谢！

Answer 1

使用regexp，你的模式应该有多精确，这可能有点棘手。我在这里快速打字：https://regex101.com/r/lF5vF2/1

你应该明白并修改这个想法以满足你的实际需要。

亲切的问候

Answer 2

不要使用正则表达式使用像bs4这样的html解析器：

from bs4 import BeautifulSoup
h = """<div class="price-display csTile-price">
       <span class="sup">$</span>
       299
       <span class="currency-delimiter">.</span>
       <span class="sup">00</span>
</div>"""
soup = BeautifulSoup(h)

amount = soup.select_one("div.price-display.csTile-price span.sup").next_sibling.strip()

哪个会给你：

或者使用currency-delimiter范围并获取上一个元素：

amount = soup.select_one("span.currency-delimiter").previous.strip()

哪个会给你相同的。你问题中的html也是通过 Javascript 动态生成的，所以你不会使用urllib.urlopen来获取它，它根本就不在返回的源中。

您需要使用{em> selenium 或使用requests模仿以下ajax调用。

import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
                    data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()

data = json.loads(js['searchResults'])

from pprint import pprint as pp
pp(data)

那给你一些json：

{u'algo': u'polaris',
 u'blacklist': False,
 u'cluster': {u'apiserver': {u'hostname': u'dfw-iss-api8.stg0',
                             u'pluginVersion': u'2.3.0'},
              u'searchengine': {u'hostname': u'dfw-iss-esd.stg0.mobile.walmart.com'}},
 u'count': 1,
 u'offset': 0,
 u'performance': {u'enrichment': {u'inventory': 70}},
 u'query': {u'actualQuery': u'43888060',
            u'originalQuery': u'43888060',
            u'suggestedQueries': []},
 u'queryTime': 181,
 u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
               u'images': {u'largeUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180',
                           u'thumbnailUrl': u'http://i5.walmartimages.com/asr/7b8fd3b1-8eed-4b68-971b-81188ddb238c_1.a181800cade4db9d42659e72fa31469e.jpeg?odnHeight=180&odnWidth=180'},
               u'inventory': {u'isRealTime': True,
                              u'quantity': 1,
                              u'status': u'In Stock'},
               u'isWWWItem': True,
               u'location': {u'aisle': [], u'detailed': []},
               u'name': u'Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01',
               u'price': {u'currencyUnit': u'USD',
                          u'isRealTime': True,
                          u'priceInCents': 29900},
               u'productId': {u'WWWItemId': u'43888060',
                              u'productId': u'2FY1C7B7RMM4',
                              u'upc': u'88560900430'},
               u'ratings': {u'rating': u'4.721',
                            u'ratingUrl': u'http://i2.walmartimages.com/i/CustRating/4_7.gif'},
               u'reviews': {u'reviewCount': u'1436'},
               u'score': u'0.507073'}],
 u'totalCount': 1}

这会为您提供您可能需要的所有信息，您所做的就是将您在网址中的参数和商店号码发布到http://www.walmart.com/store/ajax/search。

获取价格和名称：

In [22]: import requests

In [23]: import json

In [24]: js = requests.post("http://www.walmart.com/store/ajax/search",
   ....:                     data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()

In [25]: data = json.loads(js['searchResults'])

In [26]: res = data["results"][0]

In [27]: print(res["name"])
Dyson Ball Multi-Floor Bagless Upright Vacuum, 206900-01

In [28]: print(res["price"])
{u'priceInCents': 29900, u'isRealTime': True, u'currencyUnit': u'USD'}
In [29]: print(res["price"]["priceInCents"])
29900

In [30]: print(res["price"]["priceInCents"]) / 100
299

Answer 3

好的，只需搜索数字（我添加了$和。）并将结果连接成一个字符串（我用“”.join（））。

>>> txt = """
      <div class="price-display csTile-price">
          <span class="sup">$</span>
            299
          <span class="currency-delimiter">.</span>
          <span class="sup">00</span>
      </div>
      """


>>> ''.join(re.findall('[0-9$.]',txt.replace("\n","")))
'$299.00'

使用正则表达式或其他东西来捕获网站数据

3 个答案: