Rube Goldberg was born in San Fransisco in 1883. When he was a child, he wanted get an education in creative arts, but his parents wanted him to be an engineer(sounds familiar anyone?). He eventually got an engineering degree from Berkeley.

Once he got his degree, he started doing the stuff he liked, and eventually became very famous for his cartoons. His main theme was drawing pictures of machines (Goldberg Machines) that did a very simple job in a very complex way. He is rumored to spend up to 30 hours for minutiae in his pictures. Below you will see a “self operating napkin machine” he devised.

Recently I was looking at data clustering techniques in computer science, and wanted to write a post about distance metrics that are used in clustering algorithms. For Mahalanobis distance i wanted to give the Java code, but when I compared my results with online solvers, I noticed that my calculations of the covariance matrix was not correct.

So, I peeked into Apache Math package to see which part of my code was wrong. It turns out they were using the bias. My code used was unbiased. Note that using bias or not is just a choice, you can use either way.

But when I saw this, i also noticed something unusual in the Java code in the package. An equation was written in a complex way.

Instead of


double temp=0;

double length=xArray.length;

double xMean=mean(xArray);

double yMean=mean(yArray);

for (int eachRow = 0; eachRow < length; eachRow++) {

			temp+=(xArray[eachRow]-xMean)*(yArray[eachRow]-yMean);

}

result=temp/length;

the Math Package has this implementation:


double temp=0;

double length=xArray.length;

double xMean=mean(xArray);

double yMean=mean(yArray);

for (int i = 0; i < length; i++) {
			double xDev = xArray[i] - xMean;
			double yDev = yArray[i] - yMean;
			result += (xDev * yDev - result) / (i + 1);
}

The first code has 2 subtractions, 2 multiplications for each vector dimension.
The second code has three subtractions, two multiplications, 2 additions and 1 division for each dimension.
Basically both these codes do the same thing, so I wondered (and still am) how it was better to write it this way. There are two potential reasons. First, the Math code always works on smaller “result” variables, so stack overflow is avoided. But anyways it is using double, and double holds 308+ digits. Is overflow really a worry here? Second reason might be that the big values are more difficult to operate with than small ones. But is it really worth adding multiple operations to a simple calculation just because addition(result+=) will be a little bit more simple?

To see how these code fragments compare in runtime efficiency, I created 2000 vector pairs of 2000 length each, and set each pair value to random double values. For each new pair of vectors, random double values are increased by a multitude of 10 000 until 2000 vector pairs are all used.

Overall, Math code takes two times more than the simple code to compute. So is this Math code a Goldberg machine, or am i missing anything?

Advertisements