我需要根据熊猫数据框中的条件添加一个新列
Name C2Mean C1Mean
a 2 0
b 4 2
c 6 2.5
这些是条件:
if C1Mean = 0; log2FC = log2([C2Mean=2])
if C1Mean > 0; log2FC = log2([C2Mean=4]/[C1Mean=2])
if C1Mean > 0; log2FC = log2([C2Mean=4]/[C1Mean=2])
基于这些条件,我想像这样添加新列'log2FC':
Name C2Mean C1Mean log2FC
a 2 0 1
b 4 2 1
c 6 2.5 1.2630344058
我尝试的代码:
import pandas as pd
import numpy as np
import os
def induced_genes(rsem_exp_data):
pwd = os.getcwd()
data = pd.read_csv(rsem_exp_data,header=0,sep="\t")
data['log2FC'] = [np.log2(data['C2Mean']/data['C1Mean'])\
if data['C2Mean'] > 0] else np.log2(data['C2Mean'])]
print(data.head(5))
induced_genes('induced.genes')
答案 0 :(得分:2)
您可以使用以下代码:
df = pd.DataFrame({"Name":["a", "b", "c"], "C2Mean":[2,4,6], "C1Mean":[0, 2, 2.5]})
df.head()
Name C2Mean C1Mean
a 2 0.0
b 4 2.0
c 6 2.5
df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)
df.head()
Name C2Mean C1Mean log2FC
a 2 0.0 1.000000
b 4 2.0 1.000000
c 6 2.5 1.263034
axis=1
表示您要对所有行执行此操作。
答案 1 :(得分:2)
这应该有效,并且比应用更快
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":["a", "b", "c"], "C2Mean":[2,4,6], "C1Mean":[0, 2, 2.5]})
df["log2FC"] = np.where(df["C1Mean"]==0,
np.log2(df["C2Mean"]),
np.log2(df["C2Mean"]/df["C1Mean"]))
更新:时间
N = 10000
df = pd.DataFrame({"C2Mean":np.random.randint(0,10,N),
"C1Mean":np.random.randint(0,10,N)})
%%timeit -n10
a = np.where(df["C1Mean"]==0,
np.log2(df["C2Mean"]),
np.log2(df["C2Mean"]/df["C1Mean"]))
1.06 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n10
b = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0
else np.log2(x["C2Mean"]), axis=1)
248 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
速度提高了约233倍。
* UPDATE 2:删除运行时警告
只需在开头添加
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)