Question

我的数据框如下：

<div id="example1">
zoom_sensitivity "2"
sensitivity "99"
m_rawinput "0"
m_righthand "0"
</div>

<div id="example2">
sensitivity"99"m_rawinput"0"zoom_sensitivity"2"m_righthand"0"
</div>

<div id="example3">
sensitivity"99" m_rawinput "0"
m_righthand "0"
zoom_sensitivity"2"
</div>

我想以以下格式创建输出文件

 member_id  |   loan_amnt   |  Age   | Marital_status
 AK219      |    49539.09   |  34    |  Married 
 AK314      |    1022454.00 |  37    |  NA
 BN204      |    75422.00   |  34    |  Single

我知道一个名为Columns | Null Values | Duplicate | member_id | N | N | loan_amnt | N | N | Age | N | Y | Marital Status| Y | N |的python软件包，但我想以上述方式构建它，以便可以针对数据集增强代码。

Answer 1

使用类似的东西：

m=df.apply(lambda x: x.duplicated())
n=df.isna()
df_new=(pd.concat([pd.Series(n.any(),name='Null_Values'),pd.Series(m.any(),name='Duplicates')],axis=1)
                     .replace({True:'Y',False:'N'}))

Answer 2

这是python一线式：

O(NlogN)

Answer 3

实际上，Pandas_Profiling提供了多个选项，您可以在其中确定是否存在重复值。

使用python进行数据分析

3 个答案: