R中的一组间隔的子集数据

时间:2016-04-14 11:47:14

标签: r subset

我想根据一组间隔从vector中排除值。

示例数据:

<#
    .SYNOPSIS
    Test script for the ParameterizedPropertyAccessor object.
#>



<#
    .SYNOPSIS
    Create a new PSCustomObject which will contain a NoteProperty called Item accessed like a ParameterizedProperty.
#>
Function New-TestPPA
{
    # Instantiate our test object.
    $testPPA = New-Object -TypeName PSCustomObject;

    # Create a new instance of our PPA object, added to our test object, providing it Get and Set script blocks.
    # Note that currently the scripts are set up for Powershell 4 and above. If you are using a version of Powershell
    # previous to this, comment out the current Param() values, and uncomment the alternate Param() values.
    $ppa = New-ParameterizedPropertyAccessor -Parent $testPPA -Name Item -Get `
    {
        Param(
            <#
            [Parameter(Mandatory = $true, Position = 0)]
            [PSObject] $this,
            [Parameter(Mandatory = $true, Position = 1)]
            [string] $Key
            #>
            [Parameter(Mandatory = $true, Position = 0)]
            [string] $Key
        )
        $this._ht[$Key];
    } -Set {
        Param(
            <#
            [Parameter(Mandatory = $true, Position = 0)]
            [PSObject] $this,
            [Parameter(Mandatory = $true, Position = 1)]
            [string] $Key,
            [Parameter(Mandatory = $true, Position = 2)]
            [string] $Value
            #>
            [Parameter(Mandatory = $true, Position = 0)]
            [string] $Key,
            [Parameter(Mandatory = $true, Position = 1)]
            [string] $Value
        )
        $this._ht[$Key] = $Value;
    };

    # Add a HashTable <_ht> used as our backing store. Note that this could be any keyed collection type object.
    $testPPA | Add-Member -MemberType NoteProperty -Name _ht -Value @{} -PassThru;
}


[string] $scriptDir = Split-Path -Path $MyInvocation.MyCommand.Definition -Parent;
Import-Module $scriptDir\PSObjectWrappers.psm1;

# Create test object.
$testPPA = New-TestPPA;

# Note that "Item" property is actually a NoteProperty of type ParameterizedPropertyAccessor.
Write-Host "Type '`$testPPA | gm' to see Item NoteProperty.";

# Note that it is the ParameterizedPropertyAccessor object retrieved that has a ParameterizedProperty.
# Also note that Powershell has named this property "Item".
Write-Host "Type '`$testPPA.Item | gm' to see Item ParameterizedProperty";

# Step through what happens when we "set" the "parameterized" Item property.
# Note that this is actually retrieving the Item NoteProperty, and then setting its default accessor, which calls
# the 'Set' ScriptBlock.

Write-Host "";
Write-Host "Setting Name value";
Write-Host "... to 'Mark'."
$testPPA.Item["Name"] = "Mark";

# Step through what happens when we "get" the "parameterized" Item property.
# Note that this is actually retrieving the Item NoteProperty, and then retrieving its default accessor, which calls
# the 'Get' ScriptBlock.

Write-Host "";
Write-Host "Retrieving Name value:";
$temp = $testPPA.Item["Name"];
Write-Host $temp;

解决方案1:使用简单的子集() - 不合适 - mIntervals的长度可能非常大

解决方案2:使用嵌套for循环:

mydata <-  sort(runif(100,0,300))
mIntervals <- data.frame(start = c(2,50,97,159) , end = c(5,75, 120, 160))

此解决方案在R中耗时太长。

解决方案3:函数findIntervals

valid <- vector(length(mydata))
valid <- TRUE
for(i in 1:length(mydata){
 for(j in 1:length(mIntervals){
  if(mydata[i] > mIntervals[j,]$start & mydata[i] < mIntervals[j,]$end){
   valid[i] <- FALSE
  }
 }
} 
mydata[valid]

解决方案4:以某种方式使用包'间隔',但也没有合适的功能(可能是interval_overlap())

已经讨论了相似(但不完全相同)的问题here。但是有整数向量的解决方案,而不是连续变量。

我没有更多的想法。解决方案号3似乎是最好的,但我不喜欢它 - 它不健壮 - 你必须检查重叠间隔等。

对于这个非常简单的问题,有没有更好的解决方案? THX

真实数据:我有时会测量光强度(日期时间,强度)。我还有测量设备维护(开始,结束)的日期时间间隔。 现在我想清理数据=排除在维护期间测量的值(有效!)。

5 个答案:

答案 0 :(得分:8)

使用data.table的{​​{3}},我们可以尝试%anywhere%

library(data.table)
# %anywhere% returns TRUE if mydata is within any mIntervals, else FALSE
ans <- mydata[!mydata %anywhere% mIntervals] 

这将包括端点,但incbounds = TRUE是默认设置。如果需要排除端点,可以使用以下语法:

mydata[!anywhere(mydata, mIntervals[, 1], mIntervals[, 2], incbounds = FALSE)]

答案 1 :(得分:6)

如果重新安排间隔,可以使用cut功能,然后只取出奇数间隔:

NEWinterval <- c(2,5,50,75,97,120,159,160)
mydata[cut(mydata, NEWinterval,labels = F) %% 2 != 0]

答案 2 :(得分:3)

这是一个Rcpp实现:

library(Rcpp);
set.seed(12L);
mydata <- sort(runif(100L,0,300));
mIntervals <- data.frame(start=c(2,50,97,159),end=c(5,75,120,160));
cppFunction('
    LogicalVector inIntervals(DoubleVector v, DoubleVector starts, DoubleVector ends ) {
        if (starts.size()!=ends.size())
            throw new std::invalid_argument("starts and ends must be same length.");
        LogicalVector res(v.size(),false);
        for (int i = 0; i < v.size(); ++i) {
            double val = v[i];
            for (int j = 0; j < starts.size(); ++j)
                if (val>starts[j] && val<ends[j]) {
                    res(i) = true;
                    break;
                }
        }
        return res;
    }
');
mydata[!inIntervals(mydata,mIntervals$start,mIntervals$end)];
##  [1]   6.863323  10.168687  13.765236  16.585860  20.808275  28.508376  29.355912
##  [8]  30.534403  33.809681  37.152610  42.659676  45.787152  46.319152  47.274177
## [15]  47.877135  49.281417  78.640425  79.475513  80.383078  80.814563  88.273175
## [22]  93.344382  94.136411  94.736104  96.603457 126.327013 130.399146 131.800295
## [29] 131.828798 137.282145 148.542361 151.430386 162.212264 162.541752 165.648249
## [36] 166.758025 167.388096 172.243474 172.603380 176.544549 182.477693 189.979382
## [43] 192.404449 192.499610 199.703949 200.945789 202.035664 208.173427 210.533571
## [50] 212.949140 214.431451 215.524016 224.951507 225.608016 229.180120 230.324658
## [57] 232.415456 236.278594 236.350904 244.164168 244.218976 244.669498 245.332560
## [64] 247.184695 253.110672 253.267796 263.339092 263.352697 264.826916 267.979469
## [71] 282.326263 282.786520 285.996158 291.379637 293.290767 294.260683

答案 3 :(得分:2)

我不知道这会有多高效,但是......

vbetween <- Vectorize(dplyr::between, vectorize.args = c("left", "right"), SIMPLIFY=F)
mydata[!Reduce("|", vbetween(mydata, mIntervals$start, mIntervals$end))]

答案 4 :(得分:0)

我想展示另一种使用rolljoin的data.table包的方法。

首先,您融合并订购间隔数据框:

mIntervals.dt <- data.table(mIntervals)
Intervals.melt <- melt(mIntervals.dt, measure.vars = c("start", "end"))

订购数据并使用滚动加入:

mydata.dt <- data.table(mydata)
setkey(Intervals.melt, value)
setkey(mydata.dt) 

final.dt <- Intervals.melt[mydata.dt, roll = -Inf]

仅使用带有“结束”值的数据,因为您使用了-Inf(与mIntervals中的下一个最接近的值合并)。

final.dt[variable == "end"]

非常快速且灵活。