如何通过从字典中删除无效值来清理数据

时间:2018-10-05 03:40:01

标签: python python-3.x

ID,age,salary,suburb,language
P1,eighty two,60196.0,Toorak,English
P2,49,-16945514.0,St. Kilda,Chinese
P3,54,49775.0,Neverland,Italian

我有上述字典。在“年龄”列中,一些年龄是用文字写的。我想用None代替它们。

类似地,第二列中的薪水为负数或大于需要用None代替的最高薪水,并且存在无效的郊区名称也需要更改为None

>

2 个答案:

答案 0 :(得分:0)

拆分列表,然后对每个字段进行操作非常简单。 有很多小错误可以捕捉(例如,如果您的薪水不是数字),但是下面是这种处理的简单示例。

ok_suburbs = [ 'Toorak', 'St. Kilda', 'Redfern' ]

# Read list of data into <people>
people = open("people_data.txt", "rt").readlines()
del(people[0])  # remove the header

for row in people:
    try:
        id, age, salary, suburb, language = row.split(",")
    except:
        print("Invalid data: "+row)
        row = None

    if row != None:
        try:
            age = str(int(age))
        except:
            age = None
        salary = float(salary)
        if salary < 0:
            salary = None
        if suburb not in ok_suburbs:
            suburb = None
        # TODO - rebuild the row from parts

您应该处理边缘条件,例如-错误的数字,字段上的多余空间,SuBUrB NamE中的大小写,字段太少,字段太多等。

答案 1 :(得分:0)

我不清楚该数据的存储方式,因为每一行有5个条目,并且字典通常由键值对组成。我将假设ID被用作键,而其他四个条目作为成员存储在一个对象中,并以该对象作为值。我将此字典称为dict,如果您期望年龄是整数年,并且最高薪水存储在max_salary中,那么以下方法应该起作用:

for ID in dict.keys():
  age, salary = dict[ID].age, dict[ID].salary
  if not isinstance(age, int) or age < 0:
    dict[ID].age = None
  if salary < 0 or salary > max_salary:
    dict[ID].salary = None

如果您从文件中的行列表开始,则可以打开文件并将其读入这样的字典中(第一部分是从enter image description here的答案中借来的):

class PersonData(object):
  def __init__(self, age, salary, suburb, language):
    self.age = age
    self.salary = salary
    self.suburb = suburb
    self.language = language

file=open("people_data.txt", "rwt")
dict = {}
for row in file.readlines():
  try:
    ID, age, salary, suburb, language = row.split(",")
    dict[ID] = PersonData(age, salary, suburb, language)
  except:
    print("Invalid data: "+row)
    row = None

然后在检查之后,文件可能会被新数据覆盖:

file.seek(0) # go to file beginning
for ID in dict.keys():
  age, salary, suburb, language = dict[ID].age, dict[ID].salary, \
    dict[ID].suburb, dict[ID].language
  file.write(str(ID)+','+str(age)+','+str(salary)+',' \
            +str(suburb)+','+str(language)+'\n')
file.close()