Post-processing OPTICSxi Clustering

On a previous post, I expressed my concerns regarding the results of OPTICSxi clustering. Namely, I mentioned an “annoying” spike effect, that turns out massively almost at any simulation (so massively that it is almost a “feature”).

A post in StackOverflow, originated an useful exchange of ideas with the author of the ELKI framework. Namely he pointed out a “weakness” of the algorithm, that basis its partition on the reachability distance, which is not always a synonymous of “spatial closeness”. Literally, outliers that are standing in the middle of clusters, could be erroneous misinterpretated as belonging to a cluster or the other.

Since this is a problem on the partition algorithm, the solution could pass through improving the partition algorithm, using a different partition algorithm, or using a different cluster algorithm all together.  As a “quick fix” I opted to some cluster “post-processing”, in order to remove the outliers.

So my research question was: how to identify a point that is a spatial outlier?

I tried a couple of approaches that I will discuss now, as I think they may be useful for someone or generate an useful discussion.

Image

On the image above, the coloured points are outliers, according to our definition. One simple approach, would be to calculate the average path length, the average distance of one point, to all other points in the point cloud. Then we could test the distance of each point, and say if it is greater than a certain value, let us say 5,  we would consider it an outlier and remove it from the dataset. I found this approach actually yield good results, and was able to reduce the “spikes” that we see in this figure:

Image

To this results:

Image

The code that calculates the average distance for each point, is bellow:

public static double getAverageDistance(Coordinate coord, ModifiableDBIDs ids, HashMap dataMap){

double sum=0.0;

try{
for (DBIDIter iter = ids.iter();
iter.valid(); iter.advance())
sum+=coord.distance(dataMap.get(DBIDUtil.toString(iter)));
}
catch( Exception  exp) {
System.out.println( "Unexpected exception:" + exp.getMessage() );
}
return sum/ids.size();
}

From the point of view of “correctness” this algorithm suffers of the “bottleneck” of removing the outlier from the clusters (and thus converting it into “noise”). To avoid this, it would have to be tested if the point actually belongs to another cluster.

Apart from that, in terms of computation the algorithm is extremely “costly” , being the most costly part when it computes the distances from all points to all points, literally a matrix of NxN that can easily increase to huge numbers with a large cluster.

To avoid that, I tried a few different approaches. One was to calculate this distances for a part of the dataset. We know that by the way the algorithm is written, the outliers tend to “appear” either in the beginning or the end of the cluster. Having this in mind, I calculated the average based on the 60% “middle” values (n<20% and n>20%) and tested the condition for the first 20% and last 20% (this refinement was actually not needed as the “costly” part of the algorithm is the distance matrix and not the condition testing). I was not able to reach any reasonable results with this approach, either because the points were not ordered (which defeats the all purpose of my “slicing”) or because outliers were appearing outside these “classes” (i.e.: in the middle of the dataset).

The other approach that I tried was to work with the “final” polygon (the convex hull of the cluster), rather than the raw points. The polygon border has only a few points, when compared to the point cloud used to generate it. It is very easy and quick to identify the “outlier” in the polygon border; however removing it, will understandable result in a “strange polygon”, since the reality is: if that point would not be used to build the polygon, another point (that we don’t have right now!) would be used, and so the real geometry would not be this one. This is particularly noticeable when we see overlapping clusters (which don’t overlap anymore). After this experience, it became clear that the processing would have to be done in the cluster point dataset, before building the polygon.

I ended up with an algorithm that successfully removes the “spike” effects from OPTICSxi, but is rather costly in terms of time (more costly than OPTICSxi itself) and unfortunately this grows exponentially with the size of the dataset, which limits its application with big data.

UPDATE

The approach above was improved, by testing the distances against the 2 neighbours of each point (previous and next point) rather than against the entire matrix. With this hack, the running time of the algorithm was reduced to reasonable values, that don’t grow up so much with the size of the vector.