相对于其他列中的值更改DataFrame中的列

时间:2016-05-24 14:56:16

标签: python numpy pandas dataframe

我有一个看起来像这样的数据框,

Head    CHR Start   End Trans   Num 
A   1   29554   30039   ENST473358  1 
A   1   30564   30667   ENST473358  2 
A   1   30976   31097   ENST473358  3 
B   1   36091   35267   ENST417324  1 
B   1   35491   34544   ENST417324  2 
B   1   35184   35711   ENST417324  3 
B   1   36083   35235   ENST461467  1 
B   1   35491   120765  ENST461467  2

我需要根据Trans和Num列更改Column Start和End。意味着,列Trans具有重复的值,在列Num中提到。等等。意味着我想改变Start为-End + 10和End as-从下一行开始(具有相同的Trans)-10,依此类推所有行。所以我的目标是得到一个如下所示的输出,< / p>

 Head  CHR   Start  End       Trans    Num 
    A   1   30564   30667   ENST473358  1
    A   1   30976   31097   ENST473358  2
    A   1   30267   NA      ENST473358  3
    B   1   35277   35481   ENST417324  1
    B   1   34554   35174   ENST417324  2
    B   1   35721   NA      ENST417324  3
    B   1   35245   35481   ENST461467  1
    B   1   120775  NA      ENST461467  2

我非常感谢任何帮助,我可以在不考虑Trans的情况下使用以下脚本,但是我无法获得所需的输出。

start = df['Start'].copy()
df['Start'] = df.End + 10
df['End'] = ((start.shift(-1) - 10))
df.iloc[-1, df.columns.get_loc('Start')] = ''
df.iloc[-1, df.columns.get_loc('End')] = ''
print (df)

2 个答案:

答案 0 :(得分:2)

您可能需要考虑根据您希望如何使用数据来重新编制数据索引。

您可以根据列&#34; Trans&#34;和&#34; Num&#34;像这样:

#Change how we index the frame
df.set_index(["Trans", "Num"], inplace=True)

接下来,我们将抓住每个独特的索引,以便我们可以全部替换它们(我非常确定这部分和下面的迭代可以批量完成,但我只是快速完成了。如果你有效率问题可能会考虑如何不绕过所有索引。)

#Get only unique indexes
unique_trans = list(set(df.index.get_level_values('Trans')))

然后我们可以迭代并应用你想要的东西。

# Access each index
for trans in unique_trans:

    # Get the higher number in "Num" for each so we know which to set to NaN
    max_num = max(df.ix[trans].index.values)

    # Copy your start column as a temp variable
    start = df.ix[trans]["Start"].copy()

    # Apply the transform to the start column (Equal to end + 10)        
    df.loc[trans, "Start"] = np.array(df.ix[trans]["End"]) + 10

    # Apply the transform to the end column
    df.loc[trans, "End"] = np.array(start.shift(-1) - 10)

    # By passing a tuple as a row index, we get the element that is both in trans and the max number, 
    #which is the one you want to set to NaN
    df.loc[(trans, max_num), "End"] = np.nan

print(df)

运行数据时我得到的结果是:

                Head  Chr     Start      End
Trans      Num                             
ENST473358 1      A    1   30049.0  30554.0
           2      A    1   30677.0  30966.0
           3      A    1   31107.0      NaN
ENST417324 1      B    1   35277.0  35481.0
           2      B    1   34554.0  35174.0
           3      B    1   35721.0      NaN
ENST461467 1      B    1   35245.0  35481.0
           2      B    1  120775.0      NaN

我用来生成测试用例的完整代码是:

import pandas as pd
import numpy as np
# Setup your dataframe
df = pd.DataFrame(columns=["Head", "Chr", "Start", "End", "Trans", "Num"])
df["Head"] = ["A", "A", "A", "B", "B", "B", "B", "B"]
df["Chr"] = [1]*8
df["Start"] = [29554, 30564, 30976, 36091, 35491, 35184, 36083, 35491]
df["End"] = [30039, 30667, 31097, 35267, 34544, 35711, 35235, 120765]
df["Trans"] = ["ENST473358", "ENST473358", "ENST473358",
               "ENST417324", "ENST417324", "ENST417324",
               "ENST461467","ENST461467"]
df["Num"] = [1, 2, 3, 1, 2, 3, 1, 2]

# Change how we index the frame
df.set_index(["Trans", "Num"], inplace=True)

# Get only unique indexes
unique_trans = list(set(df.index.get_level_values('Trans')))

# Access each index
for trans in unique_trans:
    max_num = max(df.ix[trans].index.values)

    start = df.ix[trans]["Start"].copy()
    df.loc[trans, "Start"] = np.array(df.ix[trans]["End"]) + 10
    df.loc[trans, "End"] = np.array(start.shift(-1) - 10)
    df.loc[(trans, max_num), "End"] = np.nan

print(df)

答案 1 :(得分:1)

您可以将现有代码放入函数中,然后按var z = Math.round(d) 分组并应用函数:

Trans

结果:

def func(df):
    start = df['Start'].copy()
    df['Start'] = df.End + 10
    df['End'] = ((start.shift(-1) - 10))
    df.iloc[-1, df.columns.get_loc('Start')] = ''
    df.iloc[-1, df.columns.get_loc('End')] = ''
    return df

df.groupby('Trans').apply(func)