Question

我使用网络抓取工具来获取一些数据。我将数据存储在变量price中。 price的类型为：

<class 'bs4.element.NavigableString'>

price的每个元素的类型是：

<type 'unicode'>

基本上price包含一些空格和换行符，后跟：$520。我想消除所有额外符号并仅恢复数字520。我已经做了一个天真的解决方案：

def reducePrice(price):
    key=0
    string=""
        for i in price:
            if (key==1):
                string=string+i
            if (i== '$'):
                key=1
    key=0
    return string

但我希望实现更优雅的解决方案，将price的类型转换为str，然后使用str方法来操作它。我已经在网上和论坛的其他帖子中搜索了很多。我能得到的最好的就是使用：

p = "".join(price)

我可以生成一个大的unicode变量。如果你能给我一个提示，我将不胜感激（我在Ubuntu中使用python 2.7）。

编辑我添加我的蜘蛛以防万一你需要它：

def spider(max_pages):
        page = 1
        while page <= max_pages:
            url = "http://www.lider.cl/walmart/catalog/product/productDetails.jsp?cId=CF_Nivel2_000021&productId=PROD_5913&skuId=5913&pId=CF_Nivel1_000004&navAction=jump&navCount=12"
            source_code = requests.get(url)
            plain_text = source_code.text
            soup = BeautifulSoup(plain_text)
            title = ""
            price = ""
            for link in soup.findAll('span', {'itemprop': 'name'}):
                title = link.string
            for link in soup.find('em', {'class': 'oferLowPrice fixPriceOferUp  '}):
                price = link.string

            print(title + '='+ str(reducePrice(price)))
            page += 1

spider(1)

编辑2 感谢Martin和mASOUD我可以使用str方法生成解决方案：

def reducePrice(price):
   return int((("".join(("".join(price)).split())).replace("$","")).encode())

此方法返回int。这不是我原来的问题，但这是我项目的下一步。我添加了它是因为我们无法将unicode强制转换为int，而是首先使用encode（）生成str，我们可以。

Answer 1

使用RegEx从Unicode字符串中提取价格：

import re

def reducePrice(price):
    match = re.search(r'\d+', u'  $500  ')
    price = match.group()  # returns u"500"
    price = str(price) # convert "500" in unicode to single-byte characters.
    return price

即使此函数将Unicode转换为＆＃34;常规＆＃34;你问的字符串，你有什么理由想要吗？ Unicode字符串的工作方式与常规字符串相同。这是u"500"与"500"

几乎相同

Python：将unicode变量转换为字符串变量

1 个答案: