MongoDB : Aggregation framework : Get last dated document per grouping ID

To directly answer your question, yes it is the most efficient way. But I do think we need to clarify why this is so.

As was suggested in alternatives, the one thing people are looking at is “sorting” your results before passing to a $group stage and what they are looking at is the “timestamp” value, so you would want to make sure that everything is in “timestamp” order, so hence the form:

db.temperature.aggregate([
    { "$sort": { "station": 1, "dt": -1 } },
    { "$group": {
        "_id": "$station", 
        "result": { "$first":"$dt"}, "t": {"$first":"$t"} 
    }}
])

And as stated you will of course want an index to reflect that in order to make the sort efficient:

However, and this is the real point. What seems have been overlooked by others ( if not so for yourself ) is that all of this data is likely being inserted already in time order, in that each reading is recorded as added.

So the beauty of this is the the _id field ( with a default ObjectId ) is already in “timestamp” order, as it does itself actually contain a time value and this makes the statement possible:

db.temperature.aggregate([
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"}, "t": {"$last":"$t"} 
    }}
])

And it is faster. Why? Well you don’t need to select an index ( additional code to invoke) you also don’t need to “load” the index in addition to the document.

We already know the documents are in order ( by _id ) so the $last boundaries are perfectly valid. You are scanning everything anyway, and you could also “range” query on the _id values as equally valid for between two dates.

The only real thing to say here, is that in “real world” usage, it might just be more practical for you to $match between ranges of dates when doing this sort of accumulation as opposed to getting the “first” and “last” _id values to define a “range” or something similar in your actual usage.

So where is the proof of this? Well it is fairly easy to reproduce, so I just did so by generating some sample data:

var stations = [ 
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
    "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
    "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE",
    "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK",
    "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
    "VA", "WA", "WV", "WI", "WY"
];


for ( i=0; i<200000; i++ ) {

    var station = stations[Math.floor(Math.random()*stations.length)];
    var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50;
    dt = new Date();

    db.temperatures.insert({
        station: station,
        t: t,
        dt: dt
    });

}

On my hardware (8GB laptop with spinny disk, which is not stellar, but certainly adequate) running each form of the statement clearly shows a notable pause with the version using an index and a sort ( same keys on index as the sort statement). It is only a minor pause, but the difference is significant enough to notice.

Even looking at the explain output ( version 2.6 and up, or actually is there in 2.4.9 though not documented ) you can see the difference in that, though the $sort is optimized out due to the presence of an index, the time taken appears to be with index selection and then loading the indexed entries. Including all fields for a “covered” index query makes no difference.

Also for the record, purely indexing the date and only sorting on the date values gives the same result. Possibly slightly faster, but still slower than the natural index form without the sort.

So as long as you can happily “range” on the first and last _id values, then it is true that using the natural index on the insertion order is actually the most efficient way to do this. Your real world mileage may vary on whether this is practical for you or not and it might simply end up being more convenient to implement the index and sorting on the date.

But if you were happy with using _id ranges or greater than the “last” _id in your query, then perhaps one tweak in order to get the values along with your results so you can in fact store and use that information in successive queries:

db.temperature.aggregate([
    // Get documents "greater than" the "highest" _id value found last time
    { "$match": {
        "_id": { "$gt":  ObjectId("536076603e70a99790b7845d") }
    }},

    // Do the grouping with addition of the returned field
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"},
        "t": {"$last":"$t"},
        "lastDoc": { "$last": "$_id" } 
    }}
])

And if you were actually “following on” the results like that then you can determine the maximum value of ObjectId from your results and use it in the next query.

Anyhow, have fun playing with that, but again Yes, in this case that query is the fastest way.

Leave a Comment