Introduction
Last week I had an opportunity to program a SQL script which will update certain data to fix a data issue. The volume of the impacted data made it an interesting problem to approach.Being working on high visible, zero-downtime environment, I do not want to make this fix an yet another DB update.
Instead, I embarked on to create a resusable, configurable, repeatble, test intensive, high-performing package so that it canbe referred (if not directly re-used) for any such issues in the future.
Design
The approach can be summarized in following steps1. Load the record indentifiers of an impacted data set into a temporary table. For example, if you need to update an employee table, you then load the impacted employee_id into the temporary table assuming employee_id is the record identifier (primary key) of the employee table.
2. Create SQL Block (or stored procedure), to loop through the temporary table records via cursor and process the actual updates as required.
3. Execute the Block or Stored Procedure (created in above step) in parallel with multiple threads
4. Periodic commit the data so that if any exception occurs in the middle, we dont end up rolling back everything.
5. Devise extensive Pre and Post validation scripts for testing.
Next few sections elaborates these items.
Temporary Tables
By preloading the impacted data to temporary tables, we can effectively achieve following things1. We can precisely work with exact set of data that will be impacted. No need to keep doing expensive query on the original tables. For example, if we need to work on 1 million records on the 10 million data table, the temporary table will just have 1mm data which will be substantially cheaper to execute than on original table.
2. Add status column to track the processing as well as to add re-run capabilities
3. Add a THREAD_ID column through MOD(rownum,THREAD_COUNT) to logical partition the data. This can enable parallel execution whereas each thread can just work the records specific to the THREAD_ID.
4. Distributed execution - Once we get temporary data loaded, we can slice and dice the actual update based on the temp table data and redistribute the actual update through all available nodes.
5. Pre and Post validations becomes much easier to execute as we have capture the status column on the temp table. We can also add additional pre-processing (before update) and post-processing (post update) data on the temp table for additional validations
6. There is however one caveat on stage load. If the data is very dynamic in nature (>1000s of update in a minute or so), we may be working with stale data instead of latest upto data. This risk can be mitigated by re-checking impacted values of specific record before actual update.
Export / Import
Many prominent DBs allow export of select query into csv file as well as import the data from csv file into table. This provide two main benefits.1. Export and Import processes are independent thus reducing resource contention.
2. Export can happen on READ ONLY replica - If the select query is complex and costly, this will be very handy as it will not impact other systems.
We need to make sure Character Sets (encoding) are same between the export and import so that we can potentially corrupt the textual data in the transit.
Parallel Processing (Multi Threading)
Single threaded processing does not make sense for any update with reasonable volume. We will not only take advantage of modern multi core processors but will also limit the impact on Online DBs as everyone is trying with 99.99% availability.To do parallel update, we basically need to write a script and that spins out multiple threads with below steps
1. Open Connection
2. Select What data set of work on - Apparently with THREAD_ID isolation, each thread can work on independent set of data.
3. Execute the actual update (Either through Anonymous Block or Stored Procedure)
4. Update the status in temporary table/batch processing table.
5. Close Connection
Wait for all the threads completion and exit the process. Opening individual connections for each thread if critical here.
In some DBs (DB2), passing input parameters to anonymous SQL blocks is not possible, we need to use Stored Procedure in those cases. Stored Procedure also make the code reusable. We can also make a standard skeleton Stored procedure which can take actual SQL command as input parameter and then through EXECUTE IMMEDIATE, it can run the dynamic SQL.
Testing : Pre and Post Validations Scripts
Last but not the least, like any other programs, testing is lot more essential in batch updates. Especially, the batch updates may take long hours to complete, we should be able to monitor the progress, kill and re-run as required at any point in time.I would recommend write these scripts in unit test script fashion with possible assertions and estimated outputs.
For example, Lets say, we track the batch updates through status column on the temporary table, we can write SQLs to track the count and decipher the progress based on that.
SELECT
COUNT(*) as TOTAL_COMPLETED,
COUNT(*)*100/TOTAL_RECORDS PCT_COMPLETED,
FROM
TEMP_TABLE_STATUS_TRACKER
WHERE
STATUS = 'COMPLETED'
It is also essential to perform a volume testing in a prod-like environment before implementation.
Implementation Plan
Success of any project execution highly depends to availablity of detailed implementation plan. Each step should be written by targetting the person (DBA or DevOps or Production Support) who will execute the steps. The steps should also include how to execute the pre and post validations and if required, how to monitor the progress and all required backout/rollback plans.Its critically important, to have the steps peer reviewed and possibly executed as it is by potential implementor in lower environments.
Conclusion
Albeit it seemed to be "just another update" on the table, I learned quite a lot on many intricasies of the system, especially1. Handling huge volumes updates.
2. Faster Processing with multi threaded program.
3. Coding for failures (Thorough Exception and Error Handling).
4. Well defined monitoring and validations.
5. Data Model and System Domain Knowledge.
6. Multiple rounds of Reviews with Peers, SMEs and DBAs. Be Humble to learn.
7. "What" it takes to execute the plan in production in a highly regulated environment.
Above all, I hope to achieve with this blog, how to write a design pattern in pure English without delving into too many technical specifics. Hahaha!!!
No comments:
Post a Comment