Algorithm to find the maximum sum in a sequence of overlapping intervals

This is a weighted variation of interval scheduling; it’s solvable in O(N log N) with dynamic programming.

Let an interval be g(start, stop, score), and let them be sorted by stop. For simplicity, let’s assume for now that all stop is unique.

Let best[i] be the best score we can get when we’re allowed to use g[1], ..., g[i]. We don’t have to use them all, of course, and generally we can’t because the subset of intervals we use must be non-overlapping.

  • Clearly best[0] = 0. That is, since we can’t use any interval, the best score we can get is 0.
  • For any 1 <= k <= N, we have:
    • best[k] = max( best[k-1], best[j] + g[k].score ), where
      • j is the largest index such that g[j].stop < g[k].start (j may be zero)

That is, given that we’re allowed to use g[1], ... g[k], the best we can do is the better scoring of these two options:

  • We do not include g[k]. Thus, the score of this option is best[k-1].
    • … because that’s the best we can do with g[1], ... g[k-1]
  • We include g[k], and to its left we do the best we can with all the genes that don’t overlap with g[k], i.e. all g[1], ..., g[j], where g[j].stop < g[k].start and j is as large as possible. Thus, the score of this option is best[j] + g[k].score.

(Note the optimal substructure and overlapping subproblems components of dynamic programming embodied in the above equation).

The overall answer to the question is best[N], i.e. the best score we can get when we’re allowed to use all the genes. Oops, did I say genes? I mean intervals.

This is O(N log N) because:

  • Sorting all the intervals takes O(N log N)
  • Finding j for each k is O(log N) using binary search

If several genes can have the same stop values, then nothing changed: you still have to search for the rightmost j. In e.g. Python this is easy with bisect_right. In Java where the standard library binary search doesn’t guarantee which index is returned in case of ties, you can (among many options) follow it with a linear search (for O(N) worst-case performance), or another series of binary searches to find the right most index.

Oops did I say genes again? I mean intervals.

Related questions

Leave a Comment