数据清理:处理来自用户输入的大量不同格式

时间:2018-08-06 16:28:15

标签: python python-3.x list-comprehension data-cleaning

我的用户输入中包含一些脏数据,因此不一致。它们都是单个数字或数字范围。

number_ranges = [
    '11.6', '665.690, 705.715', '740.54-830.18ABC;900-930ABC', '1200',
    '2100 / 2200; 2320 / 2350', '2300-2400 / 2500-2560 / 2730-2740'
]

number_ranges = ','.join(number_ranges)

number_ranges = number_ranges.replace(' ', '')

number_ranges= re.sub(r"[a-zA-Z]+", "", number_ranges)

number_ranges= re.sub(r"[;]+", ",", number_ranges)

number_ranges = str(number_ranges).split(',')

这是结果列表:

[
    '11.6', '665.690', '705.715', '740.54-830.18', '900-930', '1200', '2100/2200',
    '2320/2350', '2300-2400/2500-2560/2730-2740'
]

我从这里知道

for i in number_ranges:
    if (len(i) >5) and ('.' in i) and ('-' not in i):
        i = i.replace('.','-')

for i in number_ranges:
    if ('-' in i) and ('/' in i):
        i = i.split('/')

for i in number_ranges:
    if len(i) < 3:
        i = str(int(i) * 1000)

我也尝试过这种方法:

for n, i in enumerate(number_ranges):
    if (len(i) >5) and ('.' in i) and ('-' not in i):
        number_ranges[n] = i.replace('.','-')

665.690应该是665-690,740.54-830.18ABC应该是741-830,2100/2200应该是2100-2200,11.6应该是11600

最终结果的范围应为整数元组,因此:

[(11600,), (665, 690), (705, 715), (741, 830), (900, 930), (1200,), (2100, 2200), (2320, 2350), (2300, 2400), (2500, 2560), (2730, 2740)]

如果需要,我可以在这里使用:

for pair in number_ranges:  
    number_ranges.append("{}-{}".format(*pair))

我知道逻辑,但不知道实现。

我想我要弄清楚的是如何根据特定条件替换字符/操作字符串。

这些是最常见的格式,因此我想解释一下。我知道我永远无法预测某人会投入什么,但我认为我可以处理95%以上的案件。

如果我遗漏了任何必要的信息,请提前致歉。我会尽快提供。

谢谢。

编辑: 我将其与以下代码一起使用:

number_ranges = ','.join(number_ranges)

number_ranges = number_ranges.replace(' ', '')

number_ranges= re.sub(r"[a-zA-Z]+", "", number_ranges)

number_ranges= re.sub(r"[;]+", ",", number_ranges)

number_ranges = str(number_ranges).split(',')

for n, i in enumerate(number_ranges):
    if ('-' in i) and ('/' in i):
        number_ranges[n] = i.replace('/',',')

for n, i in enumerate(number_ranges):
    if ('-' not in i) and ('/' in i):
        number_ranges[n] = i.replace('/','-')

for n, i in enumerate(number_ranges):
    if ('-' not in i) and ('.' in i) and (len(i)>4):
        number_ranges[n] = i.replace('.','-')

for n, i in enumerate(number_ranges):
    if ('.' in i) and (len(i) <= 4) and (float(i) < 30):
        number_ranges[n] = str(round(float(i) * 1000))

number_ranges = [i.split(',') for i in number_ranges]

1 个答案:

答案 0 :(得分:0)

我试图找到一种“ pythonic”的方式来编写这组规则。也许它可以给您一些想法,并且它肯定可以得到改进。

number_ranges = [
    '11.6', '665.690, 705.715', '740.54-830.18ABC;900-930ABC', '1200',
    '2100 / 2200; 2320 / 2350', '2300-2400 / 2500-2560 / 2730-2740', '433.454', '345-654'
]

import re

def outer_split(rangetext):
    '''Split the input text to individual range text.'''
    # Rule:
    # if both characters are present, use the second one to split
    # and switch the first one to '-'

    doubleseparators = ['-/', '.,', '-;', '/;'] 

    for c in doubleseparators:
        if c[0] in rangetext and c[1] in rangetext:
            outersplit = rangetext.split(c[1])
            outersplit = [s.replace(c[0], '-') for s in outersplit]
            break
    else:
            outersplit = [rangetext, ]

    return outersplit


def inner_split(rangetext):
    '''Clean the range text and Split to [left, right] boundaries.'''

    rangetext = re.sub(r'[a-zA-Z ]+', '', rangetext)

    sep = '-'
    if sep in rangetext:
        innersplit = rangetext.split(sep)
    else:
        innersplit = [rangetext,]

    # The special '.' case:
    if len(innersplit)==1 and '.' in innersplit[0]:
        l, r = innersplit[0].split('.')
        if len(l)>2 or len(r)>2:
            innersplit = [l, r]
        else:
            innersplit = [str(float(innersplit[0])*1000), ]

    return innersplit


individualinputs = [individualinput for text in number_ranges
                    for individualinput in outer_split(text)]

[inner_split(textrange) for textrange in individualinputs]

输出为:

[['11600.0'],
 ['665', '690'],
 ['705', '715'],
 ['740.54', '830.18'],
 ['900', '930'],
 ['1200'],
 ['2100', '2200'],
 ['2320', '2350'],
 ['2300', '2400'],
 ['2500', '2560'],
 ['2730', '2740'],
 ['433', '454'],
 ['345', '654']]