R and C: arrays

Let’s start this with some disclosure: I have recently grown to love R (despite it’ drawbacks), and I first ‘learned’ C about 10 years ago in an undergraduate class (wow, that’s weird to think about).  Apparently, though, I was staring at girls more than my textbook or the PuTTY terminal, and now I’m paying for it. But who could blame me? I was a mechanical engineer, fresh out of high school, and I thought a computer existed primarily for using Napster and Microshit Word.  (And  Napster was free, cool, and awesome.) So now, as a graduate student, I realize I should have cared about that class on C.

Yes, FORTRAN is still extensively used in the geosciences, but reading it is about as enjoyable as looking at a troff document. Fortunately Stack Overflow (SO) exists, and is well ranked in most Google queries, so I can quickly refresh my C skills.  It’s unfortunate that public forums often engender snarky responses, especially to half-baked questions, but SO has managed to keep that effect to a minimum;  I think it’s mostly due to the ingenious point-awarding system used by them.

OK, why am I writing all this?  I recently decided that using METEOR, a FIR filter-design code, would be fruitful.  It compiles fine with gcc and there’s an example to test it’s usage, but that’s not good enough these days:  I want to access it directly inside R.  There is a function in the twenty-something year old code that uses STDIO’s getchar(), but that poses an input problem to a naive user (me).  So, I don’t know if there’s a solution besides this…

Let’s say I have a C-code rdarr.c:

#include <stdio.h>
#include <stdlib.h>
int rdarr(char *myarr[], int narr[])
{
        int n = narr[0];
        int i;
        printf(">>>> %i\n",n);
        if (n < 1E4) {
                for(i = 0; i < n; i++){
                        printf("\t%s\n", myarr[i]);
                }
        }
        printf("<<<<\n");
        return (EXIT_SUCCESS); /* stdlib */
}

which is compiled using R CMD SHLIB rdarr.c.

If there is a function in R which accesses that C-code, like this:

dyn.load("rdarr.so")
rdarrC <- function(arr, narr){
unlist(.C("rdarr", arr, narr))
}

we can say, for example, rdarrC(c("a","b"), 2) and get:

>>>> 2
        a
        b
<<<<

So the trick will be to use something like this as a string-block reader. But, because R apparently uses arrays for any inputs to .C(), this only works when the zero-index value of the array is referenced (thus the int n = narr[0]; statement in the C-code).

It is a bit annoying to me, but I don’t know another way to do it, and the same situation might be a major gotcha-type issue for others. While I do still consider myself new to all this, there are many exciting directions integrating C with R can go.

Suggestions for improvement welcomed; but, cheers – to the learning curve!

XI Brewing logo: All about choice.

Samer is having a go at designing the XBC site.  These are my logo suggestions:

or…

An alternate logo. Thanks to Angry Boys for inspiration. OYE, NATHAN!

 

What to do about Robert and Fernando?

Where in the world is Breitbart?!?!  Hey why don’t you publish data in HDF5?  Oh yeah, cool!  OH WAIT, you also need libraries to read that format, and/or a non-totally-open-source package like PyTables, even though it’s so `widely used’ and ‘everyone uses it’, and each dataset is 500 GB.  DUH, it’s so obvious.  And why don’t you parallelize that HDF5 random-access read so it can run on the EC2 cluster?  DUH!!!!

To address those insightful and poignant questions, I wrote a short python script to test the transition from ASCII (or the ‘proprietary’ R format) to HDF5, aiming at all you non-HDF5 users out there:

from hdf5lib import eat, my, ass
class BreitbartDickfaceFartmouth(object):
"""A new class for assholes, like Andrew Breitbart"""
  def __init__(self, Robert, Fernando):
    self.asshole_dict = {'ass':'robert', 'hole':'fernando'}
    try:
      print self.asshole_dict[Robert]
      print self.asshole_dict[Fernando]
    except:
      print 10*"Oh well, I guess either they are both assholes or only one is, but I don't care which."

Easy, right!?

Note: Both Robert and Nano are ‘Pra comer’ in my book.

What to do about publishing data?

There seems to be a scale that’s tipping at the moment: Data or code that should be published* is often not.  This should seem like an odd statement for anyone in science, but it should be easy to show that most publications (at least in the geophysics community) publish data in the form of a map, or a scatter plot, or in some other non-accessible way (besides visualization).  Where is the spirit of reproducibility with such a practice?  

Publications that do supplement their paper with data usually provide it in the form of a table in an ASCII flat file.  And yes, this is what I’m arguing for, but I have come across some hideously formatted tables (the worst is when the table is in HTML or in a PDF) that nearly make me abandon all hope for the work.

So why am I writing this?  Well, I’m preparing a paper for submission right now and thinking about how I want to publish the data (besides typeset tables or graphs), and I think I’ve come up with a solution for smaller datasets: An R package of datasets in CRAN.

What is CRAN?  The Comprehensive R Archive Network.  It’s a place for users of the R language to publish their code and/or data in a way that’s usable to the R community.  Before the package (or an updated version of it) is accepted, it is checked for consistency; this means the only worry should be whether or not the code and/or data in the package is a pile of useless crap.

For the purposes of data publishing, we would need to consider a few things:

  1. Identification.  How will the user know what to access?
    1. The package would need to be identified somehow, either by a topic, or a journal, or an author/working-group.  I’d choose author since it assigns responsibility to him/her.
    2. The dataset in the package will need to be associated with a published work – perhaps a Digital Object Identifier handle?
  2. Format.
    1. Obviously, once the data are in the package they will may be accessed in the R-language.
    2. Databases such as flat files, for example, are not necessarily optimally normalized.  So I propose publishing the data with as high a normalization as the author can stand.  This will allow for robust subsetting of the data with, for example, sqldf (if SQL commands are your fancy).
  3. Size: What happens when the dataset to be archived is very large?  I propose the package author find a repository that will host the file, and then write a function which accesses the remote file.
  4. Versions.  I’m certainly no expert in version control, but there should be strict rules with this.  0.1-0 might be the first place to start which would mean: ‘zeroth version’.'one dataset’-'no minor updates/fixes’  
  5. Functions. Internal functions should probably be avoided unless they are necessary, used to reproduce a calculation (and hence a dataset), or are needed for accessing large datasets [see (3)].
  6. Methods and Classes. 
    1. The class should be something specific either to the package or dataset – I would argue for dataset (i.e. publication).
    2. There should be at least the print, summary, and print.summary methods for the new class.  If, as I propose in (A), the class is specific to the dataset, the methods could be easily customizable.

So I suppose this post has gone on long enough, but the argument seems sound given some fundamental considerations.  Data should be published, and the R/CRAN package versioning system may provide just the right opportunity for doing so easily, reproducibly, and in a way which benefits many more than just the scientific community which has journal access.  It would also force more people to learn R!

 

*I apologize if you’re not granted access this piece.

Music. Focus.

I have no problem generating focus, but there are some albums I listen to while I’m working that just seem to carry me along, weightless, rhythmic, and hyper-focused.  According to my iTunes play counter, they are:

  1. All Hour Cymbals, Yeasayer
  2. Oracular Spectacular, MGMT
  3. De-Loused in the Comatorium, the Mars Volta
  4. Takk…, Sigur Rós
  5. Security, Peter Gabriel
  6. Two Conversations, the Appleseed Cast
  7. Tao of the Dead, …And You Will Know Us By the Trail of Dead
  8. Feels, Animal Collective
  9. A Weekend in the City, Bloc Party
  10. The Lamb Lies Down on Broadway, Genesis

I’m fortunate to work in an environment that doesn’t restrict things like music, but that just means I end up on Facebook or Google Reader more often than I should.

Apache root re-direct

I setup my work server to re-direct here, since I was sick of having a boring front page.  So, if you type http://kook.ucsd.edu you’ll end up on this blog.  Simplistically, you only need to add a RewriteRule to /etc/apache2/htppd.conf  and restart the server.

<IfModule alias_module>
RewriteEngine on
RewriteRule ^/?$ http://geokook.wordpress.com [R,L]
</IfModule>

which re-writes only the root directive, and nothing below.  Nice and easy.

The server restart is easy in OSX: System Preferences > Sharing > (toggle off/on Web Sharing).

Collapse pasting in R

How do you take a vector in R, and paste the individual elements together?  Easy, a loop.  No, bad! **slap on the wrist**  Here’s an example why not.  

So let’s start by concatenating a few character strings and showing some examples using base::paste:

a <- c("something", "to", "paste")
[1] "something" "to" "paste" 
> paste(a, sep="_") 
[1] "something" "to" "paste"
> paste(a, collapse="_") 
[1] "something_to_paste"
> paste(a, sep="_", collapse="-")
[1] "something-to-paste"

And there we have it, a simple example demonstrating the difference between a separator expression and a collapse expression without using a hideous loop.  

So finally the moral: read the help pages!

SO makes me happy.

If you write any code, and there is anything you need to find out, you should be trolling Stack Overflow (SO).  It’s astounding what you can find, and how willing people are to help (unless you write a terrible question, then you’re chastised).

Just yesterday I realized my string formatting skills were not advanced enough to parse the horridly formatted NEIC moment tensor solutions (here’s an example).  Hi ho, hi ho, it’s off to SO I go. I wrote a question, and within an hour I had two useful (and different) solutions to work with – this saved me probably a day or two worth of work.

Blog at WordPress.com.
Theme: Esquire by Matthew Buchanan.

Follow

Get every new post delivered to your Inbox.