Curse of dimensionality is a widely heard of, largely misunderstood concept in machine learning. There is one single explanation of it circulating, but there is more to it. I will explain what is the curse, and how it complicates everything.

The term curse of dimensionality was coined by Richard Bellman. Dimension refers to the characteristics (also called attributes) of a data object. For example, an athlete can be a data object, and dimensions can be (1) age, (2) height, (3) gender, and (4) nationality. In this case, for every data object we have 4 dimensions. In the chart below, you can see a table from Hockey International.  We have 10 players(data objects, also called observations) and 4 dimensions(vertical jump, sprint, top speed and acceleration). For now, we only see the values for Vertical Jump, the other dimensions are hidden in the red rectangle. Assume that we want to group players. In computer science, groups are called clusters, and putting each player to a group is called clustering. When we only have vertical jump, clustering is very easy. In total, we have 5 clusters.

One dimensional data

When more dimensions need to be considered, we say that dimension increases. For example, if we want to consider Sprint in addition to vertical Jump, we will now have two dimensions.  In real life, there can be hundreds of dimensions. This increasing dimension number makes it very difficult to cluster data objects.

Two dimensional data

On web, the problem of high dimensionality is explained by just one reason, but there are many reasons. I will explain the four most significant  ones.

Advertisements