我有一个看起来像这样的数据框,
Head CHR Start End Trans Num
A 1 29554 30039 ENST473358 1
A 1 30564 30667 ENST473358 2
A 1 30976 31097 ENST473358 3
B 1 36091 35267 ENST417324 1
B 1 35491 34544 ENST417324 2
B 1 35184 35711 ENST417324 3
B 1 36083 35235 ENST461467 1
B 1 35491 120765 ENST461467 2
我需要根据Trans和Num列更改Column Start和End。意味着,列Trans具有重复的值,在列Num中提到。等等。意味着我想改变Start为-End + 10和End as-从下一行开始(具有相同的Trans)-10,依此类推所有行。所以我的目标是得到一个如下所示的输出,< / p>
Head CHR Start End Trans Num
A 1 30564 30667 ENST473358 1
A 1 30976 31097 ENST473358 2
A 1 30267 NA ENST473358 3
B 1 35277 35481 ENST417324 1
B 1 34554 35174 ENST417324 2
B 1 35721 NA ENST417324 3
B 1 35245 35481 ENST461467 1
B 1 120775 NA ENST461467 2
我非常感谢任何帮助,我可以在不考虑Trans的情况下使用以下脚本,但是我无法获得所需的输出。
start = df['Start'].copy()
df['Start'] = df.End + 10
df['End'] = ((start.shift(-1) - 10))
df.iloc[-1, df.columns.get_loc('Start')] = ''
df.iloc[-1, df.columns.get_loc('End')] = ''
print (df)
答案 0 :(得分:2)
您可能需要考虑根据您希望如何使用数据来重新编制数据索引。
您可以根据列&#34; Trans&#34;和&#34; Num&#34;像这样:
#Change how we index the frame
df.set_index(["Trans", "Num"], inplace=True)
接下来,我们将抓住每个独特的索引,以便我们可以全部替换它们(我非常确定这部分和下面的迭代可以批量完成,但我只是快速完成了。如果你有效率问题可能会考虑如何不绕过所有索引。)
#Get only unique indexes
unique_trans = list(set(df.index.get_level_values('Trans')))
然后我们可以迭代并应用你想要的东西。
# Access each index
for trans in unique_trans:
# Get the higher number in "Num" for each so we know which to set to NaN
max_num = max(df.ix[trans].index.values)
# Copy your start column as a temp variable
start = df.ix[trans]["Start"].copy()
# Apply the transform to the start column (Equal to end + 10)
df.loc[trans, "Start"] = np.array(df.ix[trans]["End"]) + 10
# Apply the transform to the end column
df.loc[trans, "End"] = np.array(start.shift(-1) - 10)
# By passing a tuple as a row index, we get the element that is both in trans and the max number,
#which is the one you want to set to NaN
df.loc[(trans, max_num), "End"] = np.nan
print(df)
运行数据时我得到的结果是:
Head Chr Start End
Trans Num
ENST473358 1 A 1 30049.0 30554.0
2 A 1 30677.0 30966.0
3 A 1 31107.0 NaN
ENST417324 1 B 1 35277.0 35481.0
2 B 1 34554.0 35174.0
3 B 1 35721.0 NaN
ENST461467 1 B 1 35245.0 35481.0
2 B 1 120775.0 NaN
我用来生成测试用例的完整代码是:
import pandas as pd
import numpy as np
# Setup your dataframe
df = pd.DataFrame(columns=["Head", "Chr", "Start", "End", "Trans", "Num"])
df["Head"] = ["A", "A", "A", "B", "B", "B", "B", "B"]
df["Chr"] = [1]*8
df["Start"] = [29554, 30564, 30976, 36091, 35491, 35184, 36083, 35491]
df["End"] = [30039, 30667, 31097, 35267, 34544, 35711, 35235, 120765]
df["Trans"] = ["ENST473358", "ENST473358", "ENST473358",
"ENST417324", "ENST417324", "ENST417324",
"ENST461467","ENST461467"]
df["Num"] = [1, 2, 3, 1, 2, 3, 1, 2]
# Change how we index the frame
df.set_index(["Trans", "Num"], inplace=True)
# Get only unique indexes
unique_trans = list(set(df.index.get_level_values('Trans')))
# Access each index
for trans in unique_trans:
max_num = max(df.ix[trans].index.values)
start = df.ix[trans]["Start"].copy()
df.loc[trans, "Start"] = np.array(df.ix[trans]["End"]) + 10
df.loc[trans, "End"] = np.array(start.shift(-1) - 10)
df.loc[(trans, max_num), "End"] = np.nan
print(df)
答案 1 :(得分:1)
您可以将现有代码放入函数中,然后按var z = Math.round(d)
分组并应用函数:
Trans
结果:
def func(df):
start = df['Start'].copy()
df['Start'] = df.End + 10
df['End'] = ((start.shift(-1) - 10))
df.iloc[-1, df.columns.get_loc('Start')] = ''
df.iloc[-1, df.columns.get_loc('End')] = ''
return df
df.groupby('Trans').apply(func)