Question

df1：

  product   product_Id   Price
0 Mobile      G67129     4500
1 Earphone    H56438     8900
2 Heater      K12346     fgdht
3 Kitchen     566578     4500
4 4359        Gh1907     5674
5 plastic     G67129     Dfz67

df2：

  Column_Name   Expected_Dtype
0 product          String
1 product_Id       String
2 Price            int

我需要从df1中找出数据类型错误值，并在df2中具有列数据类型信息。

输出：

   column_Name  Value  Exp_dtype index
0  product       4359  String    4
1  product_Id   566578 String    3
2  Price       fgdht    int      2
3  Price       Dfz67    int      5

Answer 1

由于这些类型混合在一起，都是对象，所以我只能想到使用带有strex和regex模式的模式来选择错误类型。

这是我的解决方法：

首先查找错误类型为行的

bad_product = df['product'].loc[df['product'].str.match(r'[0-9.]+')]
bad_product_ID = df.product_Id.loc[df['product_Id'].str.match(r'[0-9.]+')]
bad_price = df.Price.loc[~df['Price'].str.match(r'[0-9.]+')]

将错误行全部加入

df3 = pd.concat([bad_product,bad_product_ID,bad_price], axis=1).stack().reset_index()
df3.columns = ['index', 'Column_Name', 'value']

与df2合并

df2.set_index('Column_Name')
df3.set_index('Column_Name')
result = pd.merge(df3, df2, how='left')

结果：


  index Column_Name value   Expected_Dtype
0   2   Price       fgdht   int
1   3   product_Id  566578  String
2   4   product     4359    String
3   5   Price       Dfz67   int

当您不知道开始时，请尝试将其分解为小任务。希望这会有所帮助。

如何在熊猫数据框中找到数据类型错误？

1 个答案: