An out-of-memory error in clustering

  1. toCharArray will copy the String into a char[] every time, entirely unnecessary.

  2. Don’t use String to store bits or numbers. If you have less than 32 bits, use an int. If you have less than 64 bits, use a long. If more, use a long[].

  3. Try some optimizations based on bit operations. For example, you can compute the Hamming distance with a simple bit count and xor operation. You can also get a cheap lower bound based on the number of set bits – if one has 6 bits, the other 2, at least 4 bits have to be different.

  4. Avoid ArrayList<Integer> and ArrayList<Vertex>. These need roughly 20 bytes per integer rather than 4. That is 400% overhead. Use int[]+size, double array if full (ArrayList does the same, but uses boxed integers).

Use a profiler such as visualvm to see where you wastw memory.

My guess is that String [][] dist2 = new String[200000][276]; is to blame. 200000*276*50 is probably enough to eat all your memory. Get rid of useless strings!

Leave a Comment