Brian ONeill's Random Thoughts: July 2009

Tuesday, July 28, 2009

Global Subversion Ignore Settings (Used for Eclipse Project Files)

I hate having to run a global svn propset ignore command. Instead it is much easier to open up .subversion/config and edit the following in the file.


global-ignores = .project .target .classpath .settings *.o *.lo *.la #*# .*.rej *.rej .*~ *~ .#* .DS_Store

Notice, I added all of the eclipse files to the ignore statement. Once you update that file, that should be it. Next time you run svn status (or anything) it will take those ignore patterns into account.

Java Wordnet Library (JWNL) Jar file Repo Entry

Here is the maven pom entry for Java Wordnet Library (JWNL)


        <dependency>
            <groupId>net.didion</groupId>
            <artifactId>jwnl</artifactId>
            <version>1.4</version>
        </dependency>

Monday, July 27, 2009

CSV (Comma Separated Values) Processing in Ruby

FasterCSV rocks. You can find it here:
http://fastercsv.rubyforge.org/

Start by installing it using the gem.


sudo gem install fastercsv

After that, you are all set. Just make sure you require rubygems first.


require 'rubygems'
require 'fastercsv'

i=0
FasterCSV.foreach('blogs.csv') do |row|
   i=i+1
   puts("#{row[2]}")
end

As you can see from above, the row is an array that contains the values from the CSV file.

ActiveRecord outside of Rails (even with ODBC)

There are three quick lines that you need in order to use ActiveRecord outside of Rails. First, you need to load gems, then you can load ActiveRecord. Then, you can pick and choose which of your models to use.


require 'rubygems'
require 'activerecord'
require @@RAILS_APP_HOME + '/app/models/foo.rb'

Then, you'll need this little snippet to establish the connection to the database for ActiveRecord:


require 'rubygems'
require 'activerecord'
require 'yaml'

@@DATABASE_CONFIGURATION = YAML::load(File.open(File.dirname(__FILE__) + '/config/databases.yml'))

def establish_connection(database)
  dbconfig = @@DATABASE_CONFIGURATION
  ActiveRecord::Base.establish_connection(dbconfig[database])
#  ActiveRecord::Base.logger = Logger.new(STDERR)
  if (dbconfig['mode'] == 'odbc')
    puts("Connecting to [#{database}]: ODBC,"+
        " DSN=#{dbconfig[database]['dsn']}/#{dbconfig[database]['adapter']}"+
        " [user=#{dbconfig[database]['username']}]")
  else
    puts("Connecting to [#{database}]: #{dbconfig[database]['adapter']}, "+
        "#{dbconfig[database]['database']}@#{dbconfig[database]['host']}"+
        " [user=#{dbconfig[database]['username']}]")
  end
end

def remove_connection
  ActiveRecord::Base.remove_connection
end
~

I put the above snippet in a central ruby file, then require that file anywhere I need to use the ActiveRecord objects. After a call to establish_connection, you can start using any model you've imported. Note, you'll see a slightly different URL constructed for ODBC.

Wednesday, July 22, 2009

Hadoop: java.io.IOException: Type mismatch in key from map

We've been working with hadoop for a while now, and inevitably newbies run into this error the first time they go to create their own Hadoop job. If you are running into this error, it is most likely a mismatch between your Map and/or Reduce implementation and the job configuration.

Your Map implementation probably looks something like this:


   public static class MapClass extends MapReduceBase
            implements Mapper {
        private Text word = new Text();
        public void map(LongWritable key, Text value,
               OutputCollector output,
               Reporter reporter) throws IOException {
               ...
        }

Now, your map and reduce phases can have different output types and that is what sometimes causes the problems. If your phases are producing different types, be sure to set those types in the JobConf. You do this as follows....

Then when configuring your job you need to declare the appropriate output classes.


        // Set the outputs for the Map
        conf.setMapOutputKeyClass(Text.class);
        conf.setMapOutputValueClass(IntWritable.class);

        // Set the outputs for the Job
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(ArrayWritable.class);

Hope that saves people some time.

Finding a java class in a Jar file (or a set of files)

Often classpath problems are hard to diagnose. Sometimes you pick up an errant class on the classpath that conflicts with a version of the class that you need. (Very evil people sometimes rip apart jars and package all of their dependent classes together in a single jar)

Any who, however it happens, it is sometimes necessary to get a list of all classes everywhere, in all jar files. Then you can search that list for duplicate instances of a class.

I can't tell you how many times I've used this trick, especially with the sometimes unclear world of what is packaged into the JDK, application server, and what is in the actual application.

Use this:


find . -name '*.jar' -exec unzip -l {} \; > all_classes.txt

That line finds all the jar files recursively from the current working directory, lists the contents of the archive, and pipes that output to a text file that can be searched.

handy voodoo.

Sunday, July 19, 2009

UnsatisfiedLinkError with Surefire (on Mac OS X)

Wow, this was a needle in a hay stack. I recently needed to use LinkGrammar on my Mac OS X, and I wanted to use it via the Java Native Interfaces (JNI). I had been using LinkGrammar on my linux (ubuntu) boxes for some time. So, I was no stranger to compiling it with the java support. I even tested the compilation and install with:


nm /usr/local/lib/liblink-grammar-java.a | grep Java

However, when I was running maven, I received an UnsatisfiedLinkError. After MUCH googling I found:
http://www.nabble.com/Trouble-with-Java-Native-Libraries-td20293666.html

Simply setting the system property in maven for surefire is not sufficient, because maven changes the system property at runtime, which is too late for the VM to link to the library. Thus you need to use the following in your build section of your pom:


          <plugin>
              <groupid>org.apache.maven.plugins</groupid>
              <artifactid>maven-surefire-plugin</artifactid>
              <configuration>
                  <forkmode>once</forkmode>
                  <workingdirectory>target</workingdirectory>
                  <argline>-Djava.library.path=/usr/local/lib</argline>
              </configuration>
           </plugin>

After that, the VM will link to your libraries and surefire and all dependent tests should be able to see the java interfaces and access the necessary libraries.

Thursday, July 16, 2009

Rails Error: no such file to load -- application

I recently deployed an application I was building to production and received the following error:


no such file to load -- application

It turns out that Rails changed the name of the application controller between versions. My production environment was 2.2.2 and my development environment was 2.3.2. Renaming the application_controller file fixed everything.


mv app/controllers/application_controller.rb app/controllers/application.rb

Rake Production (Specifying an Environment to Rake)

I always forget how to specify an environment for rake commands. So I thought I would capture it here because, as it turns out, it is a hard thing to google.


rake RAILS_ENV=production db:migrate

Exec format error (in cron)

If you are like me, you sometimes get lazy and forget to include the shell command at the top of your scripts. This usually isn't a problem, but in certain cases (where the shell isn't set, or doesn't exist yet) it will cause problems. So, even if your script is set to be executable (chmod +x), you'll receive an error like:


Exec format error

Just such a case manifests itself when using a script through cron (and run-parts). To remedy this problem put the following at the beginning of the script:


#!/bin/sh

That way, the system will know what to use when executing the script.

Wednesday, July 15, 2009

An Elegant Matching Algorithm In Ruby

Recently, I wanted to sit down a learn ruby (independent of rails), so I grabbed a fairly standard hacking problem and went to town on it. I now love ruby more than ever.

The Problem:
Given a set of people and a set of jobs, where each person may take a different amount of time to do each job, optimally match people to jobs (1:1) to minimize the amount of time it will take to complete all jobs.

Many people will recognize this as a standard matching problem for which the Hungarian Algorithm is a known solution. But for those that have implemented the Hungarian Algorithm (or seen implementations of it), you know well enough to steer clear. It is error prone, and a very specific algorithm. So, I sought to implement a more elegant (and generally applicable) solution using graphs.

I found this article over on topcoder describing max-flow algorithms and the beauty of such. I fell in love and decided that I needed to solve this with max-flow.

The formulation:
Lets convert our problem to a bipartite graph. Let one set of nodes be the people, and the other set of nodes be the jobs. Create an edge from each person to each job, with a weight (NOT capacity!) equal to the time it will take that person to do that job.

In our situation, the capacity for each line is one since only one person can do a job. Flow is ofcourse initialized to zero for each edge (no one is doing any of the jobs). Lastly, we connect every job to a SINK node in the graph.

The philosophy and algorithm: (the important part)
Essentially, we'll be finding shortest paths in the graph, from each person to the SINK (via jobs) iterating through each of the people. On each iteration, we augment the graph with the path.

To recap the topcoder article, augmenting the graph consists of incrementing the flow for each edge in the augmenting path and adjusting edges to represent the new flow/capacity. There is an edge that represents "residual capacity" with capacity == capacity - flow, and there is an edge in the reverse direction that represents "upstream flow" with capacity == flow.

REMEMBER, in our case:
Capacity is ALWAYS 1.
Flow is either 1 or 0.
This is ENTIRELY independent of the cost/value/weight of an edge.

What does that mean to us you say? Well, in our case there are only two situations, a person is assigned to a job (flow == 1), or a person is not assigned ot a job (flow == 0). In the first case, where a person is assigned a job, there is an edge from the job to the person. During the algorithm, this edge essentially represents the path to UNDO the assignment. In the second case, the edge simply represents the
making that assignment.

String Array Initialization in Java

Just for all those out there that also find string array initialization in Java counter intuitive (especially if you are also a ruby enthusiast)...

To initialize an array in Java, the syntax is

String[] foo = {"bar", "serf", "doodle"};

The key is the curlies. =)

initializationError0 in JUNit

If you end up with a initializationError0 error coming out of JUnit, it is because the JUnit engine can't invoke your test method. Most likely this is because the method you annotated as a test takes a parameter. Simply remove the parameter, and you should be all set.

Monday, July 13, 2009

RSync for Backup over SSH using Different Port Number and Bandwidth Limit

Over the years, I've come to love rsync for offsite backups. It is incredibly flexible and can run over SSH. Here is the most flexible one-line backup you'll ever see:


rsync --bwlimit=100 --partial --progress --size-only -av "/Volumes/Shed/stuff/" --rsh='ssh -p 2828' "foo@offsitebackup.com:/home/stuff/"

This backups my local stuff (in /Volumes/Shed/stuff) and puts it in /home/stuff on offsitebackup.com. It also keeps partially transferred files (--partial) and shows progress (--progress). When comparing two files, it only considers the sizes (--size-only). I do this because dates could be different. Furthermore, it transfers using ssh, but over a different port (2828 in this case). Finally, I limit the bandwidth that the rsync consumers (--bwlimit) to 100 kb/s.

Very handy.

Wednesday, July 1, 2009

JDBC to MySQL Datetime (Time truncated)

Be careful when accessing Datetime fields in a MySQL database through JDBC. You might think that the java.sql.Date would work, but that actually truncates the Date back to midnight. In order to access the actual time, use getTimestamp() instead. Here is the code:


        Result rs = ...
        while (rs.next()) {
          return rs.getTimestamp(1).getTime();
        }

Sorting list of FIles in Java (returned from listFiles)

Here is some handy code to get files in order...


        File[] files = dir.listFiles();
        Arrays.sort(files, new Comparator() {
            public int compare(File f1, File f2) {
                return f1.getName().compareTo(f2.getName());
            }
        });