Whole Database Synchronization — Data Synchronization: Patterns, Tools, & Techniques

Whole database synchronization is the process of synchronizing a client device with data from a server by means of sending a pre-generated database file. This typically takes the form of an SQLite database file, due to SQLite’s ubiquity in mobile and embedded applications. Technically though, this pattern can be implemented with any type of database file, even including pre-generated JSON or XML encoded serialized datasets. It is not a perfect pattern, but it can be useful and does serve as a good foundation from which we can better understand and appreciate the more advanced patterns that we will discuss throught this book.

Source Code Samples

As a quick side note, samples and accompanying files for this chapter can be found at the following locations;

Pros and Cons

Before we get ahead of ourselves though, let’s discuss some of the criteria that are recommended for this pattern to be a good fit:

The synchronization is one-way, server to client

There are patterns for addressing two-way synchronization, or even just client to server synchronization, but this is not one of them. We will address those in a later chapter, but for now, this pattern is focused only on use cases involving sending data in one direction, and that is from the server to client devices.

The data changes infrequently

This is an inherently subjective qualification, but an important one nevertheless. As we will see later in this chapter, updates that occur too frequently will pose big problems for this pattern and ultimately negate any benefit gained from the ease of implementation. Ultimately you’ll need to determine how frequently is too frequent, but don’t expect this pattern to scale to daily or possibly even weekly updates for all but the smallest datasets.

All users have access to generally all the same data

This isn’t so much a technical requirement as it is a requirement of practicality. Technically you can pre-generate a unique database file per user of your system, but in practice, you will find that this doesn’t scale. Instead, this pattern is more ideally suited for when all of your users have access to the same data, and thusly you only have to generate one database file.

The dataset is relatively small

Very much like the requirement that your data should change infrequently, this too is very subjective. Additionally, datasets that are smaller can afford to be updated more frequently without putting too large of a demand on your users, while larger datasets will necessitate a slower update frequency.

This last bullet and its relationship to update frequency can be best understood if you take into consideration either the bandwidth limitations or the data limitations that your client devices will utilize. For example, it is not uncommon for mobile devices to have a metered amount of data per month, so in that context, a 200-megabyte dataset that is updated monthly will roughly put the same demand on your users as a 50-megabyte dataset that is updated weekly. There are other dynamics to consider as well, but we’ll reserve those for discussion later in this chapter – for now, let’s look at why we might not want to use this pattern.

The situations where you might not want to use this pattern are generally expressed in terms of the inverse of the requirements. Stated less obtusely, you’ll not want to use this pattern if any of the following are true for your use-case;

Two-way synchronization is necessary
The data updates frequently
Users often only see a subset of data tied to their profile by permissions, memberships, etc.
The dataset itself is larger than some arbitrary size

Now that we have a better understanding of when this pattern is useful, and when an application might need a more advanced synchronization pattern, let’s dive into the details of how this algorithm works.

Algorithm Overview

The algorithm for replacing the whole dataset generally has five stages;

A batch process reads all the data that is relevant to the mobile application, reformatting, condensing, and redacting as necessary
An SQLite database is generated with the data from step 1 and stored for later retrieval
A process on the mobile device makes an API call to check for updates
The API, knowing that an updated Sqlite database was generated in step 2, serves up this content to the mobile device.
The mobile device replaces its existing database with the newer one it just received.

Overview of sync algorithm

As is seen in the above overview, this pattern involves two actors outside of the mobile application – a batch process to generate SQLite databases, and an API to serve them up to the mobile device. The general idea behind this separation of concerns is based on that of processing time. The generation of the SQLite database is assumed to take a non-trivial amount of time and to be the same from one mobile client to another. For these reasons, it makes more sense to not burden the API with generating this file. Instead, this is generated once and served upon request of the mobile application.

The batch process represents the bulk of the logic in this pattern. It must know the desired database schema of the mobile device, understand the database schema of the primary database, and know how to translate between them. Once completed it must then write the generated SQLite file to a centralized location like Amazon S3 so that it can be accessed by the API, presumably running on a different server from the batch process.

The synchronization logic in the mobile application is simplified in this pattern thanks in part to the work done by the batch process. By pre-generating the entire database file in the exact format the mobile client will need, the mobile client is then freed of most responsibilities other than the downloading and swapping out of files – though as we’ll discuss later, those responsibilities are not without their own pitfalls.

Implementation

To better conceptualize this pattern, let’s return to the campground finder app I briefly mention in the earlier in this book. Such an application would generally provide users with a read-only view of their data, with all users having the same access to that data, and with updates occurring only once a quarter. In other words, a perfect fit for this pattern!

For our example let’s start with something simple in terms of a data model – a listing of campgrounds, their ratings, reviews, photos, and other details.

Schema Overview

The above schema diagram illustrates what this data model looks like in Postgres. Those same tables can be created with the following DDL;

Luckily this example maps fairly easily to SQLite, as we’ve yet to use any of the advanced features of Postgres (like JSON storage for example). In a later chapter we’ll explore techniques for handling disparate system capabilities, but for now, our mapping is straightforward.

The schema for the server side of the application (listed above), can be generally mapped to SQLite as follows:

To make this section easier to follow along with, here is some sample data. You’re welcome to skip this and use your own data or make up fake records as you see fit, but having something in these tables will make development and testing of this and future patterns all the easier.

Sample Data Download: sample_data.sql

Server Side Components

With the schema in place and data inserted into those tables, we can focus on creating the database extraction logic. Knowing the schemas of both systems, and how they map to each other (as outlined in the previous table) is key for this step. Let’s look at how an implementation of this process might look.

The above code might look obtuse if you’re not familiar with the frameworks, but it’s actually pretty straightforward once you get familiar with it. Let’s break it down into its various sections for further clarity.

First, we generate the three tables we’re working with inside a fresh SQLite database. The reason this works is that SQLite, unlike most other databases, requires no initial setup – a blank file is a blank SQLite database. So, by opening an SQLite connection to a file that does not exist, and running create statements, we can begin to populate the database schema.

For database schemas that are more complex than this, it might be worthwhile to take a slightly different approach. In those situations creating the entire schema in code might clutter the application, and distract developers from being able to more easily maintain the application. Instead, an SQLite database can be created manually, outside of the application, and already contain all the necessary schema, just without any data. In place of the above lines of code, the application would instead copy this database “template” to a new filename before opening an SQLite connection and forgo any DDL logic in code.

For the purposes of this example, however, it was easier to proceed this route due to the need to present this alternative and this example’s small schema.

With the schema in place, our next goal is to populate the data. This is accomplished by first extracting all the models from Postgres, and iterating over them and inserting them into SQLite. Since this involves loading the entire contents of the table into memory, this should only be performed on relatively small tables. For larger tables, some form of pagination is necessary, as to avoid needing to keep all the records in memory at once. Considering the limited scope of this pattern, it is perhaps best to avoid such complexities here and err on the side of simplicity.

Such a process is relatively unintelligent as it regenerates everything from scratch, without consideration for if a record has actually changed. In the next chapter we’ll start seeing techniques for how to perform differential synchronization, but for now, we will just have to be mindful to balance how frequently we run this process. Some use cases might necessitate updates as infrequently as yearly or quarterly, while others might prefer a monthly or weekly refresh of data. Great care must be taken with this pattern to balance the desire to keep content fresh, with the consideration of infrequent application users.

Updating too infrequently has obvious drawbacks, as changes will be delayed up to the number of days in the refresh cycle. In a scenario where the mobile dataset is updated once every two weeks, the worst-case scenario is that mobile users will have to wait two weeks before seeing an update. This can be mitigated by increasing the refresh frequency from once every two weeks to once every week, or even more frequently. At some point you run the risk of forcing your users to download the entire dataset on some interval when only a fraction of the data has even changed – or worse, nothing has changed.

Before we get too far ahead of ourselves with this consideration, let’s first review how the API and mobile components of this process works.

The API in this pattern is fairly basic, as most of the heavy lifting has been offloaded to the previously discussed Sqlite generation process.

The API takes a database file checksum that it will use to determine if the caller needs an updated database, or if they already have the latest version. Checking the database version with a checksum verification serves a dual purpose.

To allow for easy database versioning – no additional information outside of an SQLite database file is necessary.
If a database becomes corrupt, for whatever reason, an update will be triggered and hopefully resolve the situation.

When the API detects that the checksum the caller provided matches the current database file’s checksum, it simply returns an HTTP 304 status code – short of “Not Modified” – and nothing else. The database file is not transferred to the caller, avoiding unnecessary bandwidth usage. The caller – the mobile application in our example – will need to check the HTTP status codes to make the determination of what to do when this API call returns to them.

Mobile Updates

The mobile side of this pattern consists of a process to check if updates are available, and apply them if they are. Since this pattern generates a whole SQLite database on the server the mobile application only has to download and save the updated SQLite database in order to apply these changes. Let’s look at that logic now;

In this example, we can see the two primary methods we said we needed – one for downloading the updates and one for applying them. There is also some boilerplate logic for ensuring that our sync helper is a singleton, a pattern that I have found works well for sync workers. We will not spend too much time discussing this, except to say that this pattern allows us to have a single instance of this class serve the whole application without the mess and clutter that static classes bring with them. In this example, so early in this book, it might be hard to see the value this is driving, but as our sync worker becomes more complex later on this is a pattern that will serve us well. Considering everything going on in the above code sample, it is worth analyzing its various components before moving on.

Starting at the bottom we can see two methods, isUpdatePending and hasPerformedInitialSync. These are two helper methods that assist in determining the state of the application sync. In this pattern the sync process can be in a number of states;

Initial sync necessary

This is the default state of the application. The application has recently been installed and has no local database. An initial sync must be performed before the application is usable.

No update pending

In this state, the application has completed its initial sync (see above), but has no pending update that is ready to be applied. This could be because the application already has the latest version of the database, is in the process of downloading an update, or possibly has not yet checked for updates. Regardless of the underlying scenario, there are currently no updates that are pending installation.

Update pending

The final state signifies that an updated database file has been successfully downloaded and is ready to be applied at the application’s discretion. After the update is eventually applied, the sync process transition back to the no update pending state.

The above two helper methods allow the application to determine which of those three states the sync process is in. The observant will probably notice the possibility for a fourth state, due to the usage of two boolean values to represent our three states. This is true in theory, but in practice does not occur. This is illustrated in the below table;

The false, forth, state would represent that an update is available but the initial sync has not yet been performed. Short of a bug in the application itself, this is a situation that should never actually occur. This is ensured by virtue of how the downloadUpdates method performs the initial sync, which we will look at soon. But first, let’s look at the applyOutstandingUpdates method.

This method is responsible for, as its name would imply, applying any outstanding database updates. This function does not check to determine if the existing database is currently in use, only that a newer database file exists. If a newer file exists, it moves it on top of the previous database, thusly finalizing the update. One detail you can see in both this method and the previous isUpdatePending method, is that pending updates are simply database files that are given an alternate name. The downloadUpdates method is responsible for creating those files.

The first part of the downloadUpdates method pertains to computing the checksum that will be supplied to the API. As we discussed earlier in this chapter, the API compares the checksum supplied by the caller to the current database for purposes of determining whether to reply with the database in full or to short-circuit and reply with an HTTP 304 (Not Modified) status code; because of this, the above code must supply either no checksum at all, the checksum of the current database, or the checksum of the most recently downloaded, but not yet applied, database. The rationale behind this last scenario – supplying the most recently downloaded database – has to do with the three sync states discussed earlier, and the timing of database updates. For example, a database may have an already been downloaded, but not yet applied, when the application goes to check for updates again. In this scenario the desired effect is to not download the same file again; however, what if a newer database does exist, or the previous download is correct? In those scenarios, we would want the application to download the database again, and thusly why the above logic is structured as it is.

In the last section of the downloadUpdates method, we can see the logic that is actually responsible for making the API call an processing the result. The processing of the result is particularly noteworthy. In it we see that the application expects to see either an HTTP 200 (OK) or an HTTP 304 (Not Modified), anything else is ignored. Moreover, the logic that checks for HTTP 304 is procedurally unnecessary, existing simply for clarity and to document that the application is intentionally ignoring that return code.

Observant readers may have noticed the initialSync variable that was defined in the earlier snippet of the downloadUpdates method. In this second half of the method we can see how that variable is being; when an initial sync occurs, the application must write the database out directly to the database.db file, therefore downloading and applying the update in a single step. In all other scenarios, the recently downloaded file must first be written to an intermediate file – updated.db – to avoid replacing the existing database while it is potentially in use.

In fact, the largest consideration in this entire process is ensuring that the existing database is not in use when attempting to replace it with a newer copy. We will have to be mindful when calling the above applyOutstandingUpdates function – calling it at the wrong time can cause corruption of the database itself or possibly application crashes – likely both. This is not something we want, so special considerations must be made.

There are generally two ways to ensure that a database can be safely replaced;

Only replace it on application launch or shutdown – therefore ensuring users are not yet, or no longer, using the previous database
Replace it while the application is running

The first option – to replace the database only when the application launches or stops – is the safest and easiest to implement, so we’ll look at that one first.

In the above example, we see that the application checks for new updates during launch, and then applies them when it exits. Generally speaking, this pattern works well for ensuring that the database is not in use by the application when it’s being updated. Unfortunately, as mobile applications have grown in their abilities, they are also running in the background for longer periods of time. This means that an application’s exit event handler may not be called when an application goes into the background, and therefore this means that this pattern’s database update may not be applied for some indeterministic amount of time.

The solution to this is to allow the database to be updated while the application is running. The complexity of this approach varies greatly based on the dataset and the application itself. Generally, the following conditions greatly increase the complexity;

Datasets where records can be deleted
Datasets without primary keys
Datasets with time-sensitive elements (e.g. effective dates, expiration dates on listings, etc)

In short, we are talking about situations where we can either not uniquely identify a row in the database from one version of the data to the next, or situations where records might go away (either logically or physically) between versions. Ensuring these conditions will not occur may be difficult or even unavailable, and since we’re replacing the whole database anyway it is usually best to assume the worst and plan for those scenarios.

In this procedure, the most complex factor is ensuring that the database is not currently being used. Since the patterns discussed in this section revolve around replacing the entire SQLite database it is crucial to ensure that this is accomplished when there are no open connections to it.

There are many ways to ensure that no database connections are open, but perhaps the simplest way is to ask the user. By presenting the user with a notification that a database update is available, and asking them if they’d like to apply that now, you’re accomplishing two things. Firstly, you’re providing feedback to the user that new and updated data is available in the application, and secondly, you’re ensuring that the user is not currently on any application screens and therefore should not have any open connections to the database that cannot otherwise be closed.

Such a technique might not fit the type of application you’re wanting to create, as you might prefer to not bother the user to have to do a manual step. There are other techniques that could be employed to allow the database to be updated without intruding upon the user, such as;

Locking the database to ensure atomic access, and therefore ensuring an opportune time to upgrade
Tracking open connections, and upgrading when the connection count goes to zero

However, it is the author’s opinion that such techniques are seldom worth the complexity. If your application needs anything more robust than either replacing the database during startup/shutdown or via manual user request, perhaps your application itself is too complex for a pattern as simple as the one outlined in this chapter, and would most likely benefit from one of the other more robust syncing mechanism discussed in later chapters.

With that summary, let’s look at what changes are necessary to interactively prompt the user when updates are available, and to request their acceptance before applying an update. The process might look something like this;

Check if an update is available, and download if so
In a UI element display if an updated database is ready to be applied or not
1. If updates are available, enable touch or click events on the UI element
2. If updates are not available, disable touch or click events
When a user touches or clicks to apply updates swap out the database file for the updated one.

In the above outline, the largest difference between it and the process we have already seen is the aspect of user acceptance. In the last implementation the applyOutstandingUpdates method was called automatically on application exit, now we will tie that method call to a user interface event.

This can be accomplished by moving the call to apply the outstanding update to a UI event, as is seen in the above example. This small change has a huge impact on the usability and viability of this pattern for modern mobile applications, in that it allows users to receive critical feedback as updates are occurring. The only detail remaining is concurrency.

Concurrency

I almost did not include this section in this chapter, preferring instead to discuss it in a later chapter, but eventually decided against it. You cannot avoid the discussion of concurrency in modern programming, and data synchronization is no exception. Even simple patterns like those discussed in this chapter are prone to breaking if concurrency is not taken into consideration.

Take for example the simple scenario outlined above, where a user clicks or taps a button to accept and apply an outstanding database update. When happens if a user were to tap the apply button twice? Or, how would the user know when the database has successfully been applied?

These are questions regarding concurrency and is a theme that will occur many times throughout this book. Even though we have not discussed it, the above Objective-C code examples already have taken some steps to address concurrency – though not completely. The example code makes use of Grand Central Dispatch – GCD for short – a framework to make task concurrency easier. GCD is an Apple technology, and makes sense for use in our examples using Objective-C; however, for non-Apple platforms, an alternative solution will be required ¹.

The singleton instance of the Sync class is one example of where we have already addressed concurrency concerns by utilizing GCD. The usage of a static method, variable, and in combination with a dispatch_once GCD call provide us a thread-safe singleton and thusly avoid any race conditions that might have otherwise occurred during its initialization. What has not been addressed is the concurrency of the two primary methods in the Sync class – downloadUpdates and applyOutstandingUpdates. Specifically scenarios such as either of those methods running at the same time or two calls to one of them running concurrently. Either of those scenarios needs to be avoided, otherwise, we risk creating inefficiencies or even bugs in the application.

Attentive readers may have noticed the above line of code in either the applyOutstandingUpdates or the downloadUpdates method. With the above line, we are requesting that some unit of work be performed in an asynchronous fashion using GCD. This was done primarily out of best practice for mobile applications in general, as the bulk of the logic in either of those aforementioned methods is time-consuming and certainly not something you’d want blocking a UI thread with. Additionally, it was also setup this way knowing that concurrency would need to be addressed and that GCD provides several mechanisms that will assist in that endeavor. Specifically serial dispatch queues.

Grand Central Dispatch is a topic well beyond the scope of this book, but for our purposes, it is sufficient to simply understand that a dispatch queue is a primary resource for executing tasks, and by extension is our primary means of controlling concurrency. The above-referenced line of code dispatches an asynchronous task on a global queue, which happens to allow for parallel execution of tasks. In short, two calls to applyOutstandUpdates, for example, could execute in parallel. Fortunately, GCD provides a second type of dispatch queue for serial execution of tasks. To make use of this we will first need to create a task queue of our own:

In this excerpt, we are creating a serial GCD queue and keeping a reference to it as an instance variable in the Sync class. The final step is to modify downloadUpdates and applyOutstandingUpdates to utilize this queue by substituting out the line reading

for one reading like

The application is now better suited to handle concurrency, now that the changes discussed in this section have been implemented. As we progress throughout this book concurrency is a topic that will be revisited, and in doing so we will expand upon the tools we use to manage it.

¹ – While not technically true that GCD is unavailable on non-Apple platforms, it is more of a practicality. For those who are interested, libdispatch is the library that provides the underlying implementation of GCD and that is available on non-Apple platforms such as FreeBSD.

Summary

In this chapter, we’ve seen how to implement the whole-database synchronization pattern, and even discuss how to know if this pattern is a good fit for a given situation. It is the author’s opinion that this pattern is very important, if only for the purposes of serving as a foundation from which we can better understand and appreciate more advanced synchronization patterns.

It should be noted that this pattern violates the principle of separation of concerns by embedding knowledge of the mobile application’s implementation and needs into the server-side process that generates the SQLite database. From an architectural standpoint, this is certainly not ideal. However, there are many applications that are limited enough in scope that this lapse in sound architectural design more than makes up for the benefit of a quicker implementation.

Another potential downside to this pattern is that, unlike patterns we’ll see next, it does not directly lend itself to iterative improvement. That is to say, when an application has outgrown this pattern, it will generally have to re-implement all of the data synchronization logic to move to a different pattern as there are no direct and logical segues to a more advanced pattern.