我有一个包含多个观察值的数据集,每个观察值都有来自不同来源的外部数据。这些观察值(帐户)中的一些具有一个数据源的数据,而在另一数据源中却缺少数据。我制作了一个示例数据集来解释:
/* Update competition_category custom field in post */
add_action( 'save_post', 'update_competition_category', 20, 1 );
function update_competition_category() {
/** Get the logged in User's ID **/
global $current_user_id_2;
$current_user_id_2 = get_current_user_id();
/** Get the most recent post for the logged in user */
$recent_posts = wp_get_recent_posts([
'author' => $current_user_id_2,
'post_type' =>'image',
'numberposts' => 1
]);
foreach( $recent_posts as $recent ) {}
global $my_post_id_2;
$my_post_id_2 = $recent["ID"];
/** Get the competition Category for the most recent post */
global $post_competition_category;
$terms = get_the_terms( $my_post_id_2, 'imagepress_image_category' );
foreach($terms as $term) {}
$post_competition_category = $term->name;
/** Update the post 'competition_category' field */
update_post_meta($my_post_id_2, 'competition_category', $post_competition_category);
}
在此示例中,我丢失了2个客户的信用评分数据,因此这些级别将丢失。
这是一个类别变量,它具有3个级别的“高”,“低”和“缺失”。我将获得除一个级别的“缺失”之外的所有奇异点。假设我有1000个观测值,而我的100个观测值缺少此数据源中的数据,因此我附加到此数据集的任何变量都将具有“缺失”值。
我不想摆脱丢失的数据,我也不必认为估算均值是最好的主意,因为每个帐户规模的差异肯定会有很大的差异。
我的主要问题是:如果我的GLM输出仅对分类变量(因子)中缺少的数据级别给出奇异性,那会很不好吗?我还能相信其他非缺失变量水平的估计吗?
df <-data.frame(Account =c("A","B","C","D","E","F","G","H"),
Exposure = c(1,50,67,85,250,25,22,89),
CreditScore=c("Missing","High","Missing","Low","Low","Low","High","High"),
CreditScore2=c("Missing","Low","Missing","High","Low","High","High","Low"),
CreditScore3=c("Missing","Low","Missing","High","High","High","Low","High"),
Losses = c(100000,100,2500,100000,25000,0,7500,5200),
LossPerUnit = c(100000,100,2500,100000,25000,0,7500,5200)/c(1,50,67,85,250,25,22,89))
> df
Account Exposure CreditScore CreditScore2 CreditScore3 Losses LossPerUnit
1 A 1 Missing Missing Missing 100000 100000.00000
2 B 50 High Low Low 100 2.00000
3 C 67 Missing Missing Missing 2500 37.31343
4 D 85 Low High High 100000 1176.47059
5 E 250 Low Low High 25000 100.00000
6 F 25 Low High High 0 0.00000
7 G 22 High High Low 7500 340.90909
8 H 89 High Low High 5200 58.42697
感谢您提供的任何见解。