我假设我处理一个关于基于另一个列值(此处为患者)对数据进行子集化的简单问题,但无法找到解决方案。我需要为每位住院病人至少4次的数据子集。换句话说,只有在医院住过至少4次的患者才能在新的df中看到他们的4个访问行。我的表看起来像这样:
</style>
<table class="tg">
<tr>
<th class="tg-yw4l">Patient</th>
<th class="tg-yw4l"># Hospital Visits</th>
<th class="tg-yw4l">Duration</th>
</tr>
<tr>
<td class="tg-yw4l">Monica</td>
<td class="tg-yw4l">1</td>
<td class="tg-yw4l">10D</td>
</tr>
<tr>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">1</td>
<td class="tg-yw4l">5D</td>
</tr>
<tr>
<td class="tg-yw4l">Monica</td>
<td class="tg-yw4l">2</td>
<td class="tg-yw4l">3D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">1</td>
<td class="tg-yw4l">2D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">2</td>
<td class="tg-yw4l">3D</td>
</tr>
<tr>
<td class="tg-yw4l">Monica</td>
<td class="tg-yw4l">3</td>
<td class="tg-yw4l">4D</td>
</tr>
<tr>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">2</td>
<td class="tg-yw4l">4D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">3</td>
<td class="tg-yw4l">8D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">4</td>
<td class="tg-yw4l">9D</td>
</tr>
</table>
&#13;
非常感谢!
答案 0 :(得分:1)
df1 <- readHTMLTable(doc)[[1]]
colnames( df1 ) <- gsub("# ", '', colnames( df1 ))
df1$`Hospital Visits` <- as.numeric( df1$`Hospital Visits`)
df1
# Patient Hospital Visits Duration
# 1 Monica 1 10D
# 2 Jack 1 5D
# 3 Monica 2 3D
# 4 Eric 1 2D
# 5 Eric 2 3D
# 6 Monica 3 4D
# 7 Jack 2 4D
# 8 Eric 3 8D
# 9 Eric 4 9D
仅获得患者至少4次去过医院的事件
with( df1, df1[ `Hospital Visits` >= 4, ] )
# Patient Hospital Visits Duration
# 9 Eric 4 9D
至少4次访问医院的患者的所有事件
do.call( 'rbind', lapply( split( df1, df1$Patient ),
function( x ) if( any(x$'Hospital Visits' >= 4 ) ) { x }) )
# Patient Hospital Visits Duration
# Eric.4 Eric 1 2D
# Eric.5 Eric 2 3D
# Eric.8 Eric 3 8D
# Eric.9 Eric 4 9D
数据:强>
library(XML)
doc <- htmlParse('<table class="tg">
<tr>
<th class="tg-yw4l">Patient</th>
<th class="tg-yw4l"># Hospital Visits</th>
<th class="tg-yw4l">Duration</th>
</tr>
<tr>
<td class="tg-yw4l">Monica</td>
<td class="tg-yw4l">1</td>
<td class="tg-yw4l">10D</td>
</tr>
<tr>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">1</td>
<td class="tg-yw4l">5D</td>
</tr>
<tr>
<td class="tg-yw4l">Monica</td>
<td class="tg-yw4l">2</td>
<td class="tg-yw4l">3D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">1</td>
<td class="tg-yw4l">2D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">2</td>
<td class="tg-yw4l">3D</td>
</tr>
<tr>
<td class="tg-yw4l">Monica</td>
<td class="tg-yw4l">3</td>
<td class="tg-yw4l">4D</td>
</tr>
<tr>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">2</td>
<td class="tg-yw4l">4D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">3</td>
<td class="tg-yw4l">8D</td>
</tr>
<tr>
<td class="tg-yw4l">Eric</td>
<td class="tg-yw4l">4</td>
<td class="tg-yw4l">9D</td>
</tr>
</table>')
答案 1 :(得分:0)
无数种方法,一种简单的方式,虽然不是最有效的......
假设您在数据框中有此内容,则可以过滤掉具有4个或更多内容的ID(在本例中为名称)。然后显示这些名称的所有记录。我正在命名您的原始数据框my_df
who_to_include <- subset(unique(my_df$name),hospital_visits>=4)
library(dplyr)
4_or_more <- inner_join(who_to_include,my_df)
很抱歉,没有例子可以离开这里,所以我只是在这里代码,可能不是100%正确,或者可能是
答案 2 :(得分:0)
假设您在数据框中有这个内容并且列#34;患者&#34;唯一地指定患者(即没有多个Erics),您也可以仅使用基数R对其进行子集化:
# Find row numbers of entries with number of visits >= 4
frequentPatientRows <- patientsDf[, "# Hospital Visits"] >= 4
# Extract names from those rows
frequentPatientNames <- patientsDf[frequentPatientRows, "Name"]
# Select all entries for patients with those names
selectedPatients <- patientsDf[patientsDf[, "Name"] %in% frequentPatientNames, ]