How to query DynamoDB by date (range key), with no obvious hash key?

Although a Global Secondary Index seems to fit your requirements, any attempt to include timestamp related information as part of your Hash Key will most likely create what is known as “Hot Partition”, which is extremely undesirable.

The uneven access will occur as the most recent items are going to be retrieved with way more frequency than the old ones. This will not only impact your performance but also make your solution less cost effective.

See some details from the documentation:

For example, if a table has a very small number of heavily accessed
partition key values, possibly even a single very heavily used
partition key value, request traffic is concentrated on a small number
of partitions – potentially only one partition. If the workload is
heavily unbalanced, meaning that it is disproportionately focused on
one or a few partitions, the requests will not achieve the overall
provisioned throughput level. To get the most out of DynamoDB
throughput, create tables where the partition key has a large number
of distinct values, and values are requested fairly uniformly, as
randomly as possible.

Based on what is stated, the id seems indeed to be a good choice for your Hash Key (aka. Partition Key), I wouldn’t change that as GSI keys work in the same way as far as partitioning. As a separate note, performance is highly optimized when you retrieve your data by providing the entire Primary Key, so we should try to find a solution that provides that whenever possible.

I would suggest creating separate tables to store the primary keys based on how recent they were updated. You can segment the data into tables based on the granularity that best fits your use cases. For example, say that you want to segment the updates by day:

a. Your daily updates could be stored in tables with the following naming convention: updates_DDMM

b. The updates_DDMM tables would only have the id‘s (hash keys of the other table)

Now say that the latest app refresh date was from 2 days ago (04/07/16) and you need to get the recent records, you would then need:

i. Scan the tables updates_0504 and updates_0604 to get all the hash keys.

ii. Finally obtain the records from the main table (containing lat/lng, name, etc) by submitting a BatchGetItem with all the obtained hash keys.

BatchGetItem is super fast and will do the job like no other operation.

One can argue that creating additional tables will add cost to your overall solution… well, with GSI you are essentially duplicating your table (in case you are projecting all fields) and adding that additional cost for all ~2k records, being them recently updated or not…

It seems counter intuitive creating tables like this but it is actually a best practice when dealing with time series data (From AWS DynamoDB Documentation):

[…] the applications might show uneven access pattern across all the items
in the table where the latest customer data is more relevant and your
application might access the latest items more frequently and as time
passes these items are less accessed, eventually the older items are
rarely accessed. If this is a known access pattern, you could take it
into consideration when designing your table schema. Instead of
storing all items in a single table, you could use multiple tables to
store these items. For example, you could create tables to store
monthly or weekly data. For the table storing data from the latest
month or week, where data access rate is high, request higher
throughput and for tables storing older data, you could dial down the
throughput and save on resources.

You can save on resources by storing “hot” items in one table with
higher throughput settings, and “cold” items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.

Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

I hope that helps. Regards.

More Related Contents:

Leave a Comment Cancel reply