如果在pandas dataframe中为非数字,则跳过对行的操作

时间:2017-08-09 16:59:41

标签: python pandas dataframe

我有一个数据框:

import pandas as pd
df = pd.DataFrame({'start' : [5, 10, '$%%', 20], 'stop' : [10, 20, 30, 40]})
df['length_of_region'] = pd.Series([0 for i in range(0, len(df['start']))])

我想仅针对非零数字行值计算区域长度,并且如果值不正确则跳过具有错误注释的行的函数。以下是我到目前为止的情况:

df['Notes'] = pd.Series(["" for i in range(0, len(df['region_name']))])

for i in range(0, len(df['start'])):
    if pd.isnull(df['start'][i]) == True:
        df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
        df['critical_error'][i] = True
        num_error = num_error+1
    else:
        try:
            #print (df['start'][i]).isnumeric()
            start = int(df['start'][i])
            #print start
            #print df['start'][i]
            if start == 0:
                raise ValueError
        except:
            df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
            #print df['start'][i]
            df['critical_error'][i] = True
            num_error = num_error+1
for i in range(0, len(df['start'][i])):
    if df['critical_error'][i] == True:
        continue
    df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

但是,pandas会将df['start']转换为str变量,即使我使用int进行转换,也会出现以下错误:

df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
  

TypeError:不支持的操作数类型 - :' numpy.int64'和' str'

我在这里缺少什么?谢谢你的时间!

2 个答案:

答案 0 :(得分:1)

您可以定义自定义函数来进行计算,然后将该函数应用于每一行。

def calculate_region_length(x):
    start_val = x[0]
    stop_val = x[1]
    try:
        start_val = float(start_val)
        return (stop_val - start_val) + 1.0
    except ValueError:
        return None

自定义函数接受列表作为输入。该函数将测试起始值以查看它是否可以转换为浮点数。如果不能,则返回None。这种方式如果' 1'作为字符串存储,该值仍然可以转换为浮点数并且不会被跳过,而' $ %%'在您的示例中不能并且将返回None

接下来,为每一行调用自定义函数:

df['length_of_region'] = df[['start', 'stop']].apply(lambda x: calculate_region_legnth(x), axis=1)

这将为(stop - start) + 1.0创建新列,其中start不是不可转换的字符串,None其中start是无法转换为的字符串一个数字。

然后,您可以根据返回Notes的行更新None字段,以确定缺少起始值的区域:

df.loc[df['length_of_region'].isnull(), 'Notes'] = df['region_name']

答案 1 :(得分:0)

在盯着代码一段时间之后,找到了一个简单而优雅的解决方法,将df['start'][i]重新分配给start try-except,如下所示:

for i in range(0, len(df['start'])):
    if pd.isnull(df['start'][i]) == True:
        df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
        df['critical_error'][i] = True
        num_error = num_error+1
    else:
        try:
            start = int(df['start'][i])
            df['start'][i] = start
            if start == 0:
                raise ValueError
        except:
            df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
            #print df['start'][i]
            df['critical_error'][i] = True
            num_error = num_error+1
for i in range(0, len(df['start'][i])):
    if df['critical_error'][i] == True:
        continue
    df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

重新分配起始变量,将其转换为int格式并帮助仅为数字列计算length_of_region