How do you query object collections in Java (Criteria/SQL-like)?

Filtering is one way to do this, as discussed in other answers.

Filtering is not scalable though. On the surface time complexity would appear to be O(n) (i.e. already not scalable if the number of objects in the collection will grow), but actually because one or more tests need to be applied to each object depending on the query, time complexity more accurately is O(n t) where t is the number of tests to apply to each object.

So performance will degrade as additional objects are added to the collection, and/or as the number of tests in the query increases.

There is another way to do this, using indexing and set theory.

One approach is to build indexes on the fields within the objects stored in your collection and which you will subsequently test in your query.

Say you have a collection of Car objects and every Car object has a field color. Say your query is the equivalent of “SELECT * FROM cars WHERE Car.color="blue"“. You could build an index on Car.color, which would basically look like this:

'blue' -> {Car{name=blue_car_1, color="blue"}, Car{name=blue_car_2, color="blue"}}
'red'  -> {Car{name=red_car_1, color="red"}, Car{name=red_car_2, color="red"}}

Then given a query WHERE Car.color="blue", the set of blue cars could be retrieved in O(1) time complexity. If there were additional tests in your query, you could then test each car in that candidate set to check if it matched the remaining tests in your query. Since the candidate set is likely to be significantly smaller than the entire collection, time complexity is less than O(n) (in the engineering sense, see comments below). Performance does not degrade as much, when additional objects are added to the collection. But this is still not perfect, read on.

Another approach, is what I would refer to as a standing query index. To explain: with conventional iteration and filtering, the collection is iterated and every object is tested to see if it matches the query. So filtering is like running a query over a collection. A standing query index would be the other way around, where the collection is instead run over the query, but only once for each object in the collection, even though the collection could be queried any number of times.

A standing query index would be like registering a query with some sort of intelligent collection, such that as objects are added to and removed from the collection, the collection would automatically test each object against all of the standing queries which have been registered with it. If an object matches a standing query then the collection could add/remove it to/from a set dedicated to storing objects matching that query. Subsequently, objects matching any of the registered queries could be retrieved in O(1) time complexity.

The information above is taken from CQEngine (Collection Query Engine). This basically is a NoSQL query engine for retrieving objects from Java collections using SQL-like queries, without the overhead of iterating through the collection. It is built around the ideas above, plus some more. Disclaimer: I am the author. It’s open source and in maven central. If you find it helpful please upvote this answer!

Leave a Comment