Question

我有带有基因的数据框：

    pName     genotype  feture
    person_1    TT    feature_1 
    person_1    TY    feature_2 
    person_1    YY    feature_3
    person_1    TY    feature_4 
    person_2    TT    feature_1 
    person_2    TT    feature_2 
    person_2    YY    feature_3 
    person_2    YY    feature_4

我有一些弊病。其中大多数基于一种基因型，例如：

IF feature 1 == YY interpretation =  RED
IF feature 1 == TY interpretation =  BLUE
IF feature 1 == TT interpretation =  Green

我为此编写了熊猫代码：

data.loc[(data['feture'] == 'feature_1') & (data['genotype'] == 'YY'),'interpretation'] = "RED"
data.loc[(data['feture'] == 'feature_1') & (data['genotype'] == 'TY'),'interpretation'] = "BLUE"
data.loc[(data['feture'] == 'feature_1') & (data['genotype'] == 'TT'),'interpretation'] = "Green"
etc. (3x 10 feauters)

所以我得到了

     pName     genotype  feture  interpetation
    person_1    TT    feature_1  Green
    person_1    TY    feature_2  ...
    person_1    YY    feature_3  
    person_1    TY    feature_4 
    person_2    TT    feature_1  Green
    person_2    TT    feature_2  ...
    person_2    YY    feature_3 
    person_2    YY    feature_4

但是我有两个基因特征的问题。例如：

IF feature_3 == YY interpretation =  RED
IF feature_4 == TT interpretation =  BLUE

但另外：

(IF feature_3 == YY) & (IF feature_4 == TT) interpretation =  R/B

正如U所看到的，我需要为每个具有Feature3和Feature4的人添加新行。

最终dataFrame的外观如下：

     pName     genotype  feture  interpetation
    person_1    TT    feature_1  Green
    person_1    TY    feature_2  ...
    person_1    YY    feature_3  RED
    person_1    TY    feature_4  BLUE
    person_1    YYTY  new_feature_34     R/W    #new feature based on two others
    person_2    TT    feature_1  Green
    person_2    TT    feature_2  ...
    person_2    YY    feature_3  BLUE
    person_2    YY    feature_4  BLUE
    person_2    YYYY  new_feature_34     W/W    #new feature based on two others

SO IF：

(IF feature_3 == YY) & (IF feature_4 == TY)

我添加了新行：人，既有基因型，又有名字和解释。作为例子。

我不知道熊猫怎么做。我试图找到解决方案，但没有找到。

我用纯python解决了我的问题：

1）创建人员列表。

2）对于df进行迭代并检查每个人的两个功能。

3）向数据框添加新功能：人员+ CAT（基因型1，基因型2）+ newFeatureXY +解释

但是，如果我有超过1000个人，那就太慢了。有可能在大熊猫里做到吗？

Answer 1

您可以在此处使用grep -oE '\[(GET|POST|OPTIONS|PUT|DELETE)\]' myfile.txt和groupby来构建新行，然后将它们附加到数据框中。但是由于构建新行的功能并不简单，因此我将明确声明它：

apply

使用示例数据，它给出：

def feat34(x):
    y = (x['feture'] == 'feature_3') & (x['genotype'] == 'YY')
    z = (x['feture'] == 'feature_4') & (x['genotype'] == 'TY')
    if y.any() and z.any():
        return pd.DataFrame([['YYTY','new_feature_34', 'R/B']],
                            columns=x.columns[1:])
    else:
        return None

data = data.append(data.groupby('pName').apply(feat34).reset_index(
    level=0)).sort_values('pName')

Answer 2

您可以生成新列df <- structure(list(Document_Number = c(14198915L, 14198915L, 14198915L, 14198917L, 14198917L, 14198917L, 14198917L, 14198917L, 14198917L, 14198917L, 14198917L, 14198924L, 14200000L), Article_Number = c(115027L, 100288L, 11754L, 33908L, 96478L, 33835L, 51912L, 152477L, 33831L, 100279L, 11754L, 53366L, 53366L)), class = "data.frame", row.names = c(NA, -13L))并按照Serge Ballesta的建议使用'feature_genotype'和groupby：

apply

因此，对于80000个数据集，该过程需要1.19秒。这对于删除重复项也很重要，因为import pandas as pd n = 20_000 name = ['person_']*4*n name = [p + str(i//4) for i, p in enumerate(name)] df = pd.DataFrame({'pName': name, 'genotype': ['TT', 'TY', 'YY', 'TY']*n, 'feature': ['feature_1', 'feature_2', 'feature_3', 'feature_4']*n, 'interpretation': ['Green', '...', 'RED', 'BLUE']*n}) def fill_values(x, new): v = x.feature_genotype.values if 'feature_3_YY' in v and 'feature_4_TY' in v: new.append({'pName': x.name, 'genotype': 'YYTY', 'feature': 'new_feature_34', 'interpretation': 'R/W'}) df['feature_genotype'] = df.feature + '_' + df.genotype new = [] %time df.groupby('pName').apply(lambda x: fill_values(x, new)) Wall time: 1.19 s有时会处理第一组两次：

apply

但是实际上，我建议处理此df更为舒适，因为每个唯一new = pd.DataFrame(new) new = new.drop_duplicates() df = df.append(new).drop('feature_genotype', axis=1).sort_values('pName')每行创建一行，为其他列的每个唯一值创建列。

在熊猫的两行中获得价值作为新行

2 个答案: