如何处理R GLM输出中丢失的数据奇异点

时间:2018-10-19 12:44:39

标签: r correlation glm

我有一个包含多个观察值的数据集,每个观察值都有来自不同来源的外部数据。这些观察值(帐户)中的一些具有一个数据源的数据,而在另一数据源中却缺少数据。我制作了一个示例数据集来解释:

/*   Update competition_category custom field in post */

add_action( 'save_post', 'update_competition_category', 20, 1 );

function update_competition_category() {
    /** Get the logged in User's ID **/
    global $current_user_id_2;
    $current_user_id_2 = get_current_user_id();

    /** Get the most recent post for the logged in user   */
    $recent_posts = wp_get_recent_posts([
        'author'      => $current_user_id_2, 
        'post_type'   =>'image',
        'numberposts' => 1
    ]);

    foreach( $recent_posts as $recent ) {}

    global $my_post_id_2;
    $my_post_id_2 = $recent["ID"];

    /** Get the competition Category for the most recent post  */
    global $post_competition_category;
    $terms = get_the_terms( $my_post_id_2, 'imagepress_image_category' );

    foreach($terms as $term) {}

    $post_competition_category = $term->name;

    /** Update the post 'competition_category' field */ 
    update_post_meta($my_post_id_2, 'competition_category', $post_competition_category); 
}

在此示例中,我丢失了2个客户的信用评分数据,因此这些级别将丢失。

这是一个类别变量,它具有3个级别的“高”,“低”和“缺失”。我将获得除一个级别的“缺失”之外的所有奇异点。假设我有1000个观测值,而我的100个观测值缺少此数据源中的数据,因此我附加到此数据集的任何变量都将具有“缺失”值。

我不想摆脱丢失的数据,我也不必认为估算均值是最好的主意,因为每个帐户规模的差异肯定会有很大的差异。

我的主要问题是:如果我的GLM输出仅对分类变量(因子)中缺少的数据级别给出奇异性,那会很不好吗?我还能相信其他非缺失变量水平的估计吗?

df <-data.frame(Account =c("A","B","C","D","E","F","G","H"), 
       Exposure = c(1,50,67,85,250,25,22,89),
       CreditScore=c("Missing","High","Missing","Low","Low","Low","High","High"),
       CreditScore2=c("Missing","Low","Missing","High","Low","High","High","Low"),
       CreditScore3=c("Missing","Low","Missing","High","High","High","Low","High"),

       Losses = c(100000,100,2500,100000,25000,0,7500,5200),
       LossPerUnit = c(100000,100,2500,100000,25000,0,7500,5200)/c(1,50,67,85,250,25,22,89))



> df
  Account Exposure CreditScore CreditScore2 CreditScore3 Losses  LossPerUnit
1       A        1     Missing      Missing      Missing 100000 100000.00000
2       B       50        High          Low          Low    100      2.00000
3       C       67     Missing      Missing      Missing   2500     37.31343
4       D       85         Low         High         High 100000   1176.47059
5       E      250         Low          Low         High  25000    100.00000
6       F       25         Low         High         High      0      0.00000
7       G       22        High         High          Low   7500    340.90909
8       H       89        High          Low         High   5200     58.42697

感谢您提供的任何见解。

0 个答案:

没有答案