Challenges with very large databases

We as DBAs always like to work with very large databases (VLDB) because it gives us lot of experience and our resume looks great too. :) But in reality, managing of VLDBs have tremendous challenges. Recently I’ve been asked to provide some points alone with this topic.

As of this writing, the latest version of SQL Server released to RTM is SQL Server 2019.

These points are given for SQL Server but in general they applies to any database systems.

We need to scale the databases when it gets large amount of data in order to server the applications efficiently. There are two ways to scale a database system. They are;

Horizontal scaling also call it scaling out
Vertical scaling

SQL Server natively support vertical scaling only, meaning we have to put more CPU, memory and make powerful machines that is why large databases require powerful hardware to run to meet its performance expectations. However, there is a limit for vertical scaling. That is the reason modern databases are being built to support scale out (horizontal scaling) methods. Even in Azure SQL Database supports horizontal scaling but not SQL Server.

What are the challenges with large databases?

Increased recovery time. In the event of a disaster, it takes more time to recover the database using backup restore.
Database maintenance tasks become more difficult. Such as index rebuild, index reorganization, update statistics, etc. These db maintenance tasks are necessary to keep the database performance at higher level.
Increased database backup times. As part of Disaster Recovery (DR) plan, any database system needs to be backed up. SQL Server has three different backup types, full, diff and t-log backups. With database growth, it increases the time it takes to take a full / diff backups. Taking backups will not block database activities but it consumes resources specially CPU.
Index rebuild becomes almost impossible for very large tables because of the time it takes and the blocking it creates. SQL Server has online rebuild option but that has a penalty at tempdb usage. Therefore we can notice poor query performance.
Even update statistics with full scan takes hours for very large tables. Update statistics is less resource intensive operation but it creates blocking and therefore it impacts the application performance. Due to this reason we have to find a suitable maintenance window to this operation. Maintenance windows are becoming very strict due to business reasons so we can’t just have several hours of maintenance window for the DB stuff.
If the database in question participates in transactional replication, in the event of replication rebuild, it takes ages to transfer initial data set to the subscriber. We can use backup / restore method at the subscriber as an alternative option to initialize, however it takes sometime days to copy the backup to the destination server and then restore takes hours. Due to this reason, there will be a high business impact during a replication rebuilding event.
Creation of new indexes takes time and resource intensive in very large tables. There are certain situation that we need to introduce new indexes to improve query performance but it requires careful planning and certain time window to make that change to minimize the impact for the applications.
Table level schema changes, such as adding a new column, data type change, etc takes lot of time and need careful planning for very large tables. We also need to manage the transaction log during such an event which needs lot of effort and attention from DBAs.

All the points stated about are the challenges that I see with very large databases.

Some people argue that we can use table partitioning to manage VLDBs. However that does not go away the challenges stated above.

Here is my answer to the table partitioning is not the solution. Table partitioning at the storage level only. But still the compute and memory used for a partitioned table is the same. It has to use same machine’s (host) memory and cpu plus other resources such as network to process the portioned data. Partitioning data will help to improve query performance because then it will access only a subset of data for the query, assuming portioning is done accurately but that does not solve the maintenance nightmare we face with the data growth. However, table portioning is not a scaling methodology. In actual scale out solution, it will partition data as well as other resources including cpu , memory, network, etc by adding additional nodes. Very good example is the Cassandra cluster. We can keep on adding nodes to the cluster as data grows and each node of the cluster is assigned a portion of data. So that data portion has its own compute, memory and other resources. Hope its clear.

SQL Server has another feature called Stretch Database. This feature helps here greatly because you can offload subset of data (cold data) to the Azure cloud. As long as you’re not frequently using the data offloaded to cloud, you can deal only with the operational data set which is great.