Question

只是scrapy.org的新用户和Python的新手。我在包含标签空间和换行符的 brand 和 title 属性（ JAVA OOP条款）中具有此值。我们如何修剪它以使以下2个对象属性具有此纯字符串值

item['brand'] = "KORAL ACTIVEWEAR"
item['title'] = "Boom Leggings"

下面是数据结构

{'store_id': 870, 'sale_price_low': [], 'brand': [u'\n                KORAL ACTIVEWEAR\n              '], 'currency': 'AUD', 'retail_price': [u'$140.00'], 'category': [u'Activewear'], 'title': [u'\n                Boom Leggings\n              '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u'  https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}

通过使用正则表达式搜索方法，我能够将价格调整为仅获取数字和小数，这在使用价格逗号分隔符时可能是错误的。

price = re.compile('[0-9\.]+')
item['retail_price'] = filter(price.search, item['retail_price'])

Answer 1

看来，至少对于本示例来说，您需要做的就是从brand和title值的边缘去除所有空白。为此，您不需要正则表达式，只需调用strip方法即可。

但是，您的brand不是单个字符串；它是一个字符串列表（即使列表中只有一个字符串）。因此，如果您尝试仅strip，或对其运行正则表达式，则尝试将该列表视为字符串将得到AttributeError或TypeError。 / p>

要解决此问题，您需要使用strip函数或列表理解功能在所有字符串上映射map：

item['brand'] = [brand.strip() for brand in item['brand']]
item['title'] = map(str.strip, item['title'])

...两者中的哪一个更容易理解。

如果您还有其他嵌入了空格运行的示例，并且想将每一个此类运行转换成一个空格字符，则需要在正则表达式中使用sub方法：

item['brand'] = [re.sub(ur'\s+', u' ', brand.strip() for brand in item['brand']]

注意u前缀。在Python 2中，您需要一个u前缀来构成一个unicode文字，而不是一个str（编码字节）文字。而且，即使Unicode模式本身并不关心任何非ASCII字符，也必须对Unicode字符串使用Unicode模式。（如果这一切看起来像是毫无意义的痛苦和吸引人的bug，那就是；那是Python 3存在的主要原因。）

对于retail_price，适用相同的基本观察结果。同样，它是一个字符串列表，而不仅仅是一个字符串。同样，您可能不需要正则表达式。假设价格始终是$（或其他单字符货币标记）后跟一个数字，只需将$切下并在其上调用float或Decimal：

item['retail_price'] = [float(price[1:]) for price in item['retail_price']]

…但是，如果您的示例看起来有所不同，价格两边都带有任意多余的字符，则可以在此处使用re.search，但仍需要映射它，并使用Unicode模式

您还需要从搜索中获取匹配的group，并以某种方式处理空字符串/无效字符串（搜索将返回None，并且您无法转换到float）。您必须决定要怎么做，但是尝试使用filter似乎只是想跳过它们。这很复杂，我需要分多个步骤进行：

prices = item['price']
matches = (re.search(r'[0-9.]+', price) for price in prices)
groups = (match.group() for match in matches if match)
item['price'] = map(float, validmatches)

...或将其包装在函数中。

Answer 2

您可以定义如下所示的方法，该方法接受一个对象并返回所有归一化的叶子。

import six

def normalize(obj):
    if isinstance(obj, six.string_types):
        return ' '.join(obj.split())
    elif isinstance(obj, list):
        return [normalize(x) for x in obj]
    elif isinstance(obj, dict):
        return {k:normalize(v) for k,v in obj.items()}
    return obj

这是一种递归方法，不会修改原始对象，而是返回规范化的对象。您也可以使用它来规范化字符串。

对于您的示例商品

>> item = {'store_id': 870, 'sale_price_low': [], 'brand': [u'\n                KORAL ACTIVEWEAR\n              '], 'currency': 'AUD', 'retail_price': [u'$140.00'], 'category': [u'Activewear'], 'title': [u'\n                Boom Leggings\n              '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u'  https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}

>> print (normalize(item))
>> {'category': [u'Activewear'], 'store_id': 870, 'sale_price_low': [], 'title': [u'Boom Leggings'], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'brand': [u'KORAL ACTIVEWEAR'], 'currency': 'AUD', 'image_url': [u'https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'sale_price_high': [], 'retail_price': [u'$140.00'], 'store': 'SampleStore'}

Python-删除对象中的选项卡和换行

2 个答案: