我有一个数据框:
import pandas as pd
df = pd.DataFrame({'start' : [5, 10, '$%%', 20], 'stop' : [10, 20, 30, 40]})
df['length_of_region'] = pd.Series([0 for i in range(0, len(df['start']))])
我想仅针对非零数字行值计算区域长度,并且如果值不正确则跳过具有错误注释的行的函数。以下是我到目前为止的情况:
df['Notes'] = pd.Series(["" for i in range(0, len(df['region_name']))])
for i in range(0, len(df['start'])):
if pd.isnull(df['start'][i]) == True:
df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
df['critical_error'][i] = True
num_error = num_error+1
else:
try:
#print (df['start'][i]).isnumeric()
start = int(df['start'][i])
#print start
#print df['start'][i]
if start == 0:
raise ValueError
except:
df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
#print df['start'][i]
df['critical_error'][i] = True
num_error = num_error+1
for i in range(0, len(df['start'][i])):
if df['critical_error'][i] == True:
continue
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
但是,pandas
会将df['start']
转换为str
变量,即使我使用int
进行转换,也会出现以下错误:
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
TypeError:不支持的操作数类型 - :' numpy.int64'和' str'
我在这里缺少什么?谢谢你的时间!
答案 0 :(得分:1)
您可以定义自定义函数来进行计算,然后将该函数应用于每一行。
def calculate_region_length(x):
start_val = x[0]
stop_val = x[1]
try:
start_val = float(start_val)
return (stop_val - start_val) + 1.0
except ValueError:
return None
自定义函数接受列表作为输入。该函数将测试起始值以查看它是否可以转换为浮点数。如果不能,则返回None
。这种方式如果' 1'作为字符串存储,该值仍然可以转换为浮点数并且不会被跳过,而' $ %%'在您的示例中不能并且将返回None
。
接下来,为每一行调用自定义函数:
df['length_of_region'] = df[['start', 'stop']].apply(lambda x: calculate_region_legnth(x), axis=1)
这将为(stop - start) + 1.0
创建新列,其中start
不是不可转换的字符串,None
其中start
是无法转换为的字符串一个数字。
然后,您可以根据返回Notes
的行更新None
字段,以确定缺少起始值的区域:
df.loc[df['length_of_region'].isnull(), 'Notes'] = df['region_name']
答案 1 :(得分:0)
在盯着代码一段时间之后,找到了一个简单而优雅的解决方法,将df['start'][i]
重新分配给start
try-except
,如下所示:
for i in range(0, len(df['start'])):
if pd.isnull(df['start'][i]) == True:
df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
df['critical_error'][i] = True
num_error = num_error+1
else:
try:
start = int(df['start'][i])
df['start'][i] = start
if start == 0:
raise ValueError
except:
df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
#print df['start'][i]
df['critical_error'][i] = True
num_error = num_error+1
for i in range(0, len(df['start'][i])):
if df['critical_error'][i] == True:
continue
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
重新分配起始变量,将其转换为int
格式并帮助仅为数字列计算length_of_region