用Python清理凌乱的字符串

时间:2014-10-03 21:02:11

标签: python regex

我有一个凌乱的库存清单(大约10K)要清理,我在Python中使用正则表达式来解决这个问题。以下是我的清单的一小部分样本:

product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
              "#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss", 
              "(Archived)S.O.S. Steel Wool Soap Pads", 
              "(ARCHIVED) HTH Spa pH Increaser",
              "****GLUE STICKS",
              "-20°F Splash Windshield Washer Fluid",
              "01127 – Fing’rs Mighty Drop, 3g",
              "10-01130-Brush On Nail Glue (Three Bond TB1743)",
              "Aveeno® Continuous Protection Sunblock Spray Products"]

理想情况下,我想删除#, *, ®, –, °F等符号,101, 10-01130-, 01127等数字以及括号(Archived), (Three Bond TB1743)中的世界。最终输出看起来像

product_pool=["BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
              "Cell phone, Triangle wand 5 sections lip gloss", 
              "S.O.S. Steel Wool Soap Pads", 
              "HTH Spa pH Increaser",
              "GLUE STICKS",
              "Splash Windshield Washer Fluid",
              "Fing'rs Mighty Drop",
              "Brush On Nail Glue",
              "Aveeno Continuous Protection Sunblock Spray Products"]

我的方法是按照我不想保留的符号拆分产品,然后保留所有字母。但这种方法看起来效果不是很好。所以我很感激任何建议!

for product in product_pool:
    product_split=re.split(' |, |[) |* |-]', product)
    print ' '.join(ch for ch in product_split if ch.isalpha())

输出看起来:

BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA
Cell phone Triangle wand sections lip gloss
Steel Wool Soap Pads (S.O.S. is missing)
HTH Spa pH Increaser
GLUE STICKS
Splash Windshield Washer Fluid
Mighty Drop (Fing'rs is missing)
Brush On Nail Glue Bond
Continuous Protection Sunblock Spray Products (Aveeno is missing)

2 个答案:

答案 0 :(得分:3)

product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
              "#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss", 
              "(Archived)S.O.S. Steel Wool Soap Pads", 
              "(ARCHIVED) HTH Spa pH Increaser",
              "****GLUE STICKS",
              "-20°F Splash Windshield Washer Fluid",
              "01127 – Fing’rs Mighty Drop, 3g",
              "10-01130-Brush On Nail Glue (Three Bond TB1743)",
              "Aveeno® Continuous Protection Sunblock Spray Products"]

还有一些额外的空间,但这可能是一种方法。

import string
goodChars = string.ascii_letters + '.' + ' '
cleaned = [''.join(i for i in word if i in goodChars) for word in product_pool]

>>> cleaned
[' BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA',
 'WCS  Cell phone Triangle wand   sections lip gloss',
 'ArchivedS.O.S. Steel Wool Soap Pads',
 'ARCHIVED HTH Spa pH Increaser',
 'GLUE STICKS',
 'F Splash Windshield Washer Fluid',
 '  Fingrs Mighty Drop g',
 'Brush On Nail Glue Three Bond TB',
 'Aveeno Continuous Protection Sunblock Spray Products']

您可以使用想要保留的字符,查看string.punctuation,了解string.ascii_letters,{{1}}等内容。

答案 1 :(得分:1)

您可以使用regex代替re.sub

import re

pattern = '[^a-zA-Z\s]|(?i)archived'
results = [re.sub(pattern, '', s).strip() for s in product_pool]
# ['BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA',
#  'WCS  Cell phone Triangle wand   sections lip gloss',
#  'SOS Steel Wool Soap Pads',
#  'HTH Spa pH Increaser',
#  'GLUE STICKS',
#  'F Splash Windshield Washer Fluid',
#  'Fingrs Mighty Drop g',
#  'Brush On Nail Glue Three Bond TB',
#  'Aveeno Continuous Protection Sunblock Spray Products']

正则表达式模式[^...]匹配...以外的任何内容。然后,您可以使用re.sub将所有这些匹配替换为空字符串,从而有效地删除它们。模式的第二项与archived块匹配,(?i)告诉它忽略这些块的情况。