Replication and Sharding

As database is so critical that the system’s performance depends on the database. For example if the database and low latency and high througput, the resulting system would have low latency and high throughput, similarly a high latency and low througput would result in a similar system performance. This is where replication and sharding comes into picture. Lets start with replication:

Replication

Lets say that we have a main database which has unfortunately has gone down, how can we prevent this? We can make replica of the main database, sort of as a standbyof the main database. So even though all the read and writes are happening on the main database, for every write operation the replica is updated in a synchronous way. The idea is the replica would take over when the main database goes down. And when the main database comes back up again, the replica can update the main database can swap roles between them and replica can go back to being standby.
The most important thing is the replica is always synchronously updated whenever there is write operation on the main database. And in anycase if write operation fails on the replica, the write operation on should not complete on the main database. The will improve the availability of the database in your system.
Another interesting usecase of replication is when the system ( say Linkedin ) has lots of users from two different countries - US and India, these two countries are in two differnt geographical zones. So when one of users posts and posts in Linkedin, the followers from both these countries should be able to see them, and assuming the user posting the post is from US region, then users can US region can view this post with very low latency, but the users in India may take some time. To avoid this situation we can use database replication wherein there are two databases, one in the US ( main database) and a second one in India. The database in India is asynchronously ( say every 5 minutes or 10 minutes ) synced with the main database in the US, that way users from India will notice a very low latency. This only works for situations when asychronous sync in accepted for the that part of the system, in the case of linkedin posts, a asych sync might work, but let say stock broker application we need maybe a synchronous updation.

Sharding

Imagine you have a system which has one main database. The main database is receiving thousand/millions of request and is getting overloadded, it was becoming like a bottleneck and the throughput is too low. You can always veritically scale but there is only so much you can do with vertical scaling. Another option would be horizontal scaling wherein you add more database servers ( replica of the main database ). But now we have a situation where this system is receiving millions of requests and handles tons of data ( e.g Facebook ), its not optimal to replicate the main database with such a high workload. One solution would be to split up the data, one part of the data is stored in one database server and another part of the data in another database. The main database is split into several smaller databases called shards or data-partitions. In doing so, you increase your throughput by doing this. This begs the questions on access the data, which data to store on which shard.
- You can split based on geography, say from US will update shard1 and from India shard2.
- This can also produce hotspots, some of the world might have much more users when compared to some other countries.
- Or shard based on customer names, customer starting with letters A-F can store in shard1 and G-M store in shard2, etc.
- Such a sharding strategy is not recommended because this can create a hotspot where certain shards get more data due to the nature of sharding, for example, there could be certain letters say X-Z which has lower numbers of data when compared with shad A-F.

To solve the problem we can split up the data based on a hashing, one that promises uniformity to determine which shard the data would be written to and read from. You can cannot change the hashing function here because the hashing determine which shard the data would be written to and changing the hashing function would change that as well. Here Consistent Hashing, when adding another shard, this would minimize the number of peieces of data that we have to migrate from other shards to the new shard, but if one of the existing database shards goes down then consistent hashing would not help. In that we need replica of the shard to get the database back up. In the end, its up to designer to find means to split up the data in different ways, put some thought to it.

One example of how to set this up would be client -> server -> (reverse_proxy) ->shards. The logic to select the different can be stored in the reverse_proxy server. So the reverse proxy can recieve the request from the server, which handles request from the client. There could be some if-else logic in the reverse_proxy server which select the appropriate shard database.
The key point to take away to make the correct sharding function such that no hotspots are created.

back

WXPN's Blog

Contents:

Contents:

Replication and Sharding

Replication

Sharding