Question

我正在编写一个脚本来为我的实验室执行 qPCR 检测的 LLoD 分析。我使用 pandas.read_csv() 和 usecols 参数从仪器的 .csv 数据导入相关列，列出 RNA 数量/浓度列的唯一值，然后我需要确定每个给定浓度的检测率/命中率。如果检测到目标，结果将是一个数字；如果没有，它将被列为“TND”或“未确定”或其他一些非数字字符串（取决于仪器）。所以我写了一个函数，它（应该）接受一个数量和结果的数据框，并返回该数量的检测概率。但是，在运行脚本时，出现以下错误：

Traceback (most recent call last):
  File "C:\Python\llod_custom.py", line 34, in <module>
    prop[idx] = hitrate(val, data)
  File "C:\Python\llod_custom.py", line 29, in hitrate
    df = pd.to_numeric(list[:,1], errors='coerce').isna()
  File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), 1)' is an invalid key

抛出错误 (df = pd.to_numeric(list[:,1], errors='coerce').isna()) 行中的想法是将列中的任何非数字值更改为 NaN，然后获取一个布尔数组，告诉我给定行的条目是否为 NaN，所以我可以稍后使用 df.sum() 计算数字条目的数量。我敢肯定，对于使用熊猫/数据帧的任何人来说，这应该是显而易见的，但我之前没有在 python 中使用过数据帧，所以我不知所措。我也更熟悉 C 和 JavaScript，所以像 python 这样不那么僵化的东西实际上可能有点令人困惑，因为它非常灵活。任何帮助将不胜感激。

注意conc 列将包含 5 到 10 个不同的值，每个值重复 5-10 次（即在 5-10 个浓度的每个浓度下重复 5-10 次）； detect 列将在每一行中包含一个数字或一个字符串 - 数字表示成功，字符串表示失败......就我而言，数字的值无关紧要，我只需要知道目标是否是否检测到给定的重复。我的脚本（到目前为止）如下：

import os
import pandas as pd
import numpy as np
import statsmodels as sm
from scipy.stats import norm
from tkinter import filedialog
from tkinter import *

# initialize tkinter
root = Tk()
root.withdraw()


# prompt for data file and column headers, then read those columns into a dataframe
print("In the directory prompt, select the .csv file containing data for analysis")
path = filedialog.askopenfilename()

conc = input("Enter the column header for concentration/number of copies: ")
detect = input("Enter the column header for target detection: ")
tnd = input("Enter the value listed when a target is not detected (e.g. \"TND\", \"Undetected\", etc.): ")

data = pd.read_csv(path, usecols=[conc, detect])

# create list of unique values for quantity of RNA, initialize vectors of same length
# to store probabilies and probit scores for each
qtys = data[conc].unique()
prop = probit = [0] * len(qtys)

# Function to get the hitrate/probability of detection for a given quantity
def hitrate(qty, dataFrame):
    list = dataFrame[dataFrame.iloc[:,0] == qty]
    df = pd.to_numeric(list[:,1], errors='coerce').isna()
    return (len(df) - (len(df)-df.sum()))/len(df)

# iterate over quantities to calculate the corresponding probability of Detection
# and its associate probit score
for idx, val in enumerate(qtys):
    prop[idx] = hitrate(val, data)
    probit[idx] = norm.ppf(hitrate(val, data))

# create an array of the quantities with their associated probabilities & Probit scores
hitTable = vstack([qtys,prop,probit])

可以使用以下方法创建示例数据框：

d = {'qty':[1,1,1,1,1, 10,10,10,10,10, 20,20,20,20,20, 50,50,50,50,50, 100,100,100,100,100], 'result':['TND','TND','TND',5,'TND', 'TND',5,'TND',5,'TND', 5,'TND',5,'TND',5, 5,6,5,5,'TND', 5,5,5,5,5]}
exData = pd.DataFrame(data=d)

然后只需使用exData作为原始代码中的数据框data

编辑：我通过稍微调整 Loic RW 的答案解决了这个问题。函数 hitrate 应该是

def hitrate(qty, df):
    t_s = df[df.qty == qty].result
    t_s = t_s.apply(pd.to_numeric, args=('coerce',)).isna()
    return (len(t_s)-t_s.sum())/len(t_s)

Answer 1

以下是否达到了您的要求？我对您的数据结构做了一些假设。

def hitrate(qty, df):
    target_subset = df[df.qty == qty].target
    target_subset = target_subset.apply(pd.to_numeric, args=('coerce',)).isna()
    return 1-((target_subset.sum())/len(target_subset))

如果我运行以下：

data = pd.DataFrame({'qty': [1,2,2,2,3],
                     'target': [.5, .8, 'TND', 'Undetermined', .99]})
hitrate(2, data)

我得到： 0.33333333333333337

尝试计算数据帧列子集中的 NaN 时出现 Pandas TypeError

1 个答案: