如何从一个DataFrame中选择行,其中行的索引的一部分位于另一个DataFrame的索引中并满足特定条件?

时间:2017-03-03 10:56:50

标签: pandas dataframe multi-index

我有两个DataFrame。 df提供了大量数据。 test_df描述了某些测试是否已通过。我需要通过在df中查找此信息,从test_df中仅选择测试未失败的行。到目前为止,我可以将test_df缩减为passed_tests。那么,剩下的就是只选择df中行索引的相关部分位于passed_tests的行。我怎么能这样做?

更新

  • test_db没有唯一的行。如果存在重复行(并且可能有多于1个重复),则最正面的测试优先。即真>好的>假。

我的代码:

import pandas as pd
import numpy as np


index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Ok', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
print test_df

index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux', 'qux']),
         np.array(['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b']),
         np.array(['1', '2', '1', '2', '1', '2', '1', '2'])]
data = np.random.randn(8, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
print df

passed_tests = test_df.loc[test_df['Passed?'].isin(['True', 'Ok'])]
print passed_tests

DF

                X         Y
foo a 1  0.589776 -0.234717
      2  0.105161  1.937174
    b 1 -0.092252  0.143451
      2  0.939052 -0.239052
qux a 1  0.757239  2.836032
      2 -0.445335  1.352374
    b 1  2.175553 -0.700816
      2  1.082709 -0.923095

test_df

     Passed?
foo a   False
    a    True
    b   False
    b   False
qux a   False
    b      Ok
    b   False

passed_tests

      Passed?
foo a    True
qux b      Ok

必需的解决方案

                X         Y
foo a 1  0.589776 -0.234717
      2  0.105161  1.937174
qux b 1  2.175553 -0.700816
      2  1.082709 -0.923095

1 个答案:

答案 0 :(得分:1)

<?php if(isset($_POST['create_post'])) { $post_title = $_POST['title']; $post_author = $_POST['post_author']; $post_category_id = $_POST['post_category_id']; $post_status = $_POST['post_status']; $post_image = $_FILES['image']['name']; $post_image_temp = $_FILES['image']['tmp_name']; $post_tags = $_POST['post_tags']; $post_content = $_POST['post_content']; $post_date = date('d-m-y'); $post_comment_count = 4; move_uploaded_file($post_image_temp, "../image/ $post_image"); $query = "INSERT INTO posts(post_category_id, post_title, post_author, post_date, post_image, post_content, post_tags, post_comment_count, post_status) "; $query .= "Values ( $post_category_id, '$post_title', '$post_author',now(), '$post_image', '$post_content', '$post_tags', '$post_comment_count', '$post_status') "; $connet_query_post = mysqli_query($connection, $query); if(!$connet_query_post) { die("Query Failed" . mysqli_error($connection)); } } ?> <h1 class="page-header"> Wellcome to Admin <small>author</small> </h1> <form action="" method="post" enctype="multipart/form-data"> <div class="form-group"> <label for="title">Post title</label> <input type="text" class="form-control" name="title" > </div> <div class="form-group"> <label for="post_category">Post Category Id </label> <input type="text" class="form-control" name="post_category_id" > </div> <div class="form-group"> <label for="post_author">Post Author </label> <input type="text" class="form-control" name="post_author"> </div> <div class="form-group"> <label for="post_status">Post Status </label> <input type="text" class="form-control" name="post_status" > </div> <div class="form-group"> <label for="post_image">Post Image</label> <input type="file" class="form-control" name="image" > </div> <div class="form-group"> <label for="post_tags">Post Tags </label> <input type="text" class="form-control" name="post_tags" > </div> <div class="form-group"> <label for="post_content">Post Contents</label> <textarea class="form-control" name="post_content" id="" cols="30" rows="10"></textarea> </div> <div class="form-group"> <label for="post_tags">Post Tags </label> <input type="text" name="create_post" class="form-control"> </div> <div class="form-group"> <input class="btn btn-primary" type="submit" value="Publish" name="create_post" > </div> </form> 需要reindex,然后按isin检查值,最后使用boolean indexing

Advice

编辑:

对于删除重复项,这里更容易使用:

  • reset_index
  • 获取method='ffill'的列
  • sort_values - print (test_df.reindex(df.index, method='ffill')) Passed? foo a 1 True 2 True b 1 False 2 False qux a 1 False 2 False b 1 Ok 2 Ok mask = test_df.reindex(df.index, method='ffill').isin(['True', 'Ok'])['Passed?'] print (mask) foo a 1 True 2 True b 1 False 2 False qux a 1 False 2 False b 1 True 2 True Name: Passed?, dtype: bool print (df[mask]) X Y foo a 1 -0.580448 -0.168951 2 -0.875165 1.304745 qux b 1 -0.147014 -0.787483 2 0.188989 -1.159533 列降序,第一和第二升序
  • drop_duplicates - 仅保留第一个值
  • set_index for MultiIndex back
  • rename_axis用于删除索引名称
MultiIndex

另一种解决方案更简单 - 首先排序,然后Passed?排序test_df = test_df.reset_index() .sort_values(['level_0','level_1', 'Passed?'], ascending=[1,1,0]) .drop_duplicates(['level_0','level_1']) .set_index(['level_0','level_1']) .rename_axis([None, None]) print (test_df) Passed? foo a True b False qux a False b Ok

groupby

EDIT1:

将值转换为ordered Categorical

first