Heaviside Signal Detection Part 1: Informed non-parametric testing

Steps may be frequently found in geophysical datasets, specifically timeseries (e.g. GPS).  A common approach to estimating the size of the offset is to assume (or estimate) what the statistical structure of the noise is and estimate the size and uncertainty of the step.  In a series of posts I’m hoping to address a simple question which has no simple answer: Without any information about the location of a step, and the structure of the noise in a given dataset (containing only one step), what are some novel ways to estimate the size, uncertainty, and even location of the step?

Let’s first start with the control cases: A single offset of known size, located halfway through the series, in the presence of normally distributed noise of known standard deviation.  A simple function to create this is here:

doseq <- function(heavi, n=2000, seqsd=1, seq.frac=0.5){
 # noise
 tmpseq <- rnorm(n, sd=seqsd)
 # + signal (scaled by stdev of noise)
 tmpseq[ceiling(seq.frac*n):n] <- tmpseq[ceiling(seq.frac*n):n] + heavi*seqsd
 return(tmpseq)
}

To come to some understanding about what may and may not be detected in the idealized case, we can use non-parametric sample-distance test on two samples: one for pre-step information, and the other for post-step.  A function to do that is here:

 doit <- function(n=0, plotit=FALSE){
 xseq <- doseq(n)
 lx2 <- (length(xseq) %/% 2)
 # rank the series
 xseq.r <- rank(xseq[1:(2*lx2)])
 # normalize
 xseq.rn <- xseq.r/lx2
 df <- data.frame(rnk=xseq.rn[1:(2*lx2)])
 # set pre/post factors [means we know where H(t) is]
 df$loc <- "pre"
 df$loc[(lx2+1):(2*lx2)] <- "post"
 df$loc <- as.factor(df$loc)
 # plot rank by index
 if (plotit){plot(df$rnk, pch=as.numeric(df$loc))}
 # Wilcoxon rank test of pre-heaviside vs post-heaviside
 rnktest <- wilcox.test(rnk ~ loc, data=df, alternative = "two.sided", exact = FALSE, correct = FALSE, conf.int=TRUE, conf.level=.99)
 # coin has the same functions, but can do Monte Carlo conf.ints
 # require(coin)
 # coin::wilcox_test(rnk ~ loc, data=df, distribution="approximate",
 # conf.int=TRUE, conf.level=.99)
 # The expected W-statistic for the rank-sum and length of sample 1 (could also do samp. 2)
 myW <- sum(subset(df,loc="pre")$rnk*lx2) - lx2*(lx2+1)/2
 return(invisible(list(data=df, ranktest=rnktest, n=n, Wexpected=myW)))
}

So let’s run some tests, and make some figures, shall we?  First, let’s look at how the ranked series depends on the signal-to-noise ratio; this gives us at least a visual sense of what’s being tested, namely that the there is no-difference between the two sets.

par(mar = rep(0, 4))
layout(matrix(1:10, 5, 2, byrow = F))
X<-seq(0,4.9,by=.5) # coarse resolution, for figure
sapply(X=X,FUN=function(x) doit(x,plotit=TRUE),simplify=TRUE)

Which gives the following figure:

Indexed ranked series.  From top-to-bottom, left-to-right, the signal-to-noise ratio is increased by increments of (standard deviation)/2, beginning with zero signal.

and makes clear a few things:

  1. Very small signals are imperceptible to human vision (well, at least to mine).
  2. As the signal grows, the ranked series forms two distinct sets, so a non-parametric test should yield sufficiently low values for the p statistic.
  3. And, as expected, the ranking doesn’t inform about the magnitude of the offset; so we can detect that it exists, but can’t say how large it is.

But how well does a rank-sum test identify the signal?  In other words, what are the test results as a function of signal-to-noise (SNR)?

   X<-seq(0,5,by=.1) # a finer resolution
   tmpld<<-sapply(X=X,FUN=function(x) doit(x,plotit=FALSE),simplify=T)
   # I wish I knew a more elegant solution, but setup a dummy function to extract the statistics
   tmpf <- function(nc){
      tmpd <- tmpld[2,nc]$ranktest
      W <- unlist(tmpld[4,nc]$Wexpected)
      w <- tmpd$statistic
      if (is.null(w)){w<-0}
      p<-tmpd$p.value
      if (is.null(p)){p<-1}
      return(data.frame(w=(as.numeric(w)/W), p=(as.numeric(p)/0.01)))
   }
   tmpd <- (t(sapply(X=1:length(X),FUN=tmpf,simplify=T)))
   ##
   library(reshape2)
   tmpldstat <- melt(data.frame(n=X, w=unlist(tmpd[,1]),p=unlist(tmpd[,2])), id.vars="n")
   # plot the results
   library(ggplot2)
   g <- ggplot(tmpldstat,aes(x=n,y=value, group=variable))
   g+ geom_step()+ geom_point(symbol="+")+ scale_x_continuous("Signal to noise ratio")+ scale_y_continuous("Normalized statistic")+ facet_grid(variable~., scales="free_y")+ theme_bw()

Which gives this figure:

Results of non-parametric rank-sum tests for differences between pre and post Heaviside signal onset, as a function of signal-to-noise ratio. Top: W statistic (also known as the U statistic) normalized by the expected value. Bottom: P-value normalized by the confidence limit tolerance (0.01).

and reveals a few other points:

  1. If the signal is greater than 1/10 the size of the noise, the two sets can be categorized as statistically different 95 times out of 100. In other words, the null hypothesis that they are not different may be rejected.
  2. The W-statistic is half the expected maximum value when the signal is equal in size to the noise, and stabilizes to ~0.66 by SNR=4; I’m not too familiar with the meaning of this statistic, but it might be indicating the strength of separation between the two sets which is visible in e Figure 1.

I’m quite amazed at how sensitive this type of test even for such small SNRs as I’ve given it, and this is an encouraging first step.  No pun intended.

About these ads

3 thoughts on “Heaviside Signal Detection Part 1: Informed non-parametric testing

  1. Had some trouble getting the ggplot2 part of the code to plot. Running R 2.15.0 on WinXP. Doesn’t crash. Just never displays. Haven’t taken the time to debug; not that familiar with ggplot2. Is there an alternative version that doesn’t use it?

    • The ggplot code is pretty straightforward, but you might try:

      p <- g+ geom_step()+ geom_point(symbol="+")+ scale_x_continuous("Signal to noise ratio")+ scale_y_continuous("Normalized statistic")+ facet_grid(variable~., scales="free_y")+ theme_bw()
      print(p)

      Otherwise you could use this:

      par(mar = c(5.1, 4.1, 4.1, 2.1)) # just sets figure margins back to default values
      layout(matrix(1:2, 2, 1, byrow = F))
      plot(X,unlist(tmpd[,1]),type="s",ylab="W")
      points(X,unlist(tmpd[,1]))
      plot(X,unlist(tmpd[,2]),type="s",ylab="P")
      points(X,unlist(tmpd[,2]))

      which, in hindsight, seems less complicated, and would eliminate the reshape2::melt step.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.
The Esquire Theme.

Follow

Get every new post delivered to your Inbox.

Join 37 other followers

%d bloggers like this: