VaksinCovid.gov.my blunder – What happened and what should have been done.
I missed the previous AstraZeneca vaccination program and I wanted to try my luck again this time around. So you can only imagine how excited I was to get myself and my other half enrolled, with the hope that we can do our parts together in making the surrounding safe for everyone. So I landed on the vaccination page and like every other Samaritan, I was presented with a rather horrendous experience, one after another.
So we’re not here to do the blame game, but rather share and learn together. My only frustration is the fact that this was an outbreak-solving web application and while time was critical, the funding was not. There are some really good developers in our country that the government could have appointed to build and manage this system but yet, it was not the case.
So what happened today?
The CORS nightmare. In other words, Cross-Origin Resource Sharing problem. I bet by now you have been seeing this term a lot more than ever before. In fact, it probably might be trending now, but CORS is not something new. Let’s just have a little stroll in the park about this.
In simpler words, CORS restricts resources to be requested outside of the domain that is being requested. For example, if our domain is pokde.net and we tried to perform a request from pokde.com which is a totally different domain, we will be presented with CORS error. Bear in mind, this doesn’t apply to embedded request like images, stylesheets, script source, or videos. If you are sourcing a font library from Google Font’s website for example, it will not trigger a CORS error.
Whis is this important? Back in the IE days, XSS or Cross-Site Scripting was a major pain for developers. If somehow you managed to inject a cross-site malicious code into a website, the results can be devastating. Essentially, a hacker can be snooping the actions on a website, or totally fire a redirect if necessary. This is why CORS policies were introduced and over the years, it’s being enforced more strictly, which is a good thing.
So what happened to the vaksincovid.gov.my website? Weren’t they running the system on their own? Short answer? Yes, they were. Long answer?
What the developers of vaksincovid.gov.my did was, they created an API request with the URL https://api.vaksincovid.gov.my/az/ followed by a series of parameters that you can perform a request on, to get a relevant response. This is excellent! Having an isolated API means this data can later be extended to other applications like MySejahtera or a dedicated mobile app, or even to the hospitals in the future who would like to check the status of an individual’s vaccination status.
However, as you can see, while the main domain of the API is the same as the visited site, the URI is not where the subdomain makes the final URI totally different. In the default CORS same-origin policy, vaksincovid.gov.my and api.vaksincovid.gov.my are two different domains. When configuring a webserver, we can define wildcards to allow subdomains, but if a site allows cookie sharing, then every single subdomain has to be explicitly defined and wildcards will not be allowed. This is probably where the blunder on vaksincovid.gov.my happened.
When the API itself is not working, we can expect abnormal behaviors. I would assume the first call to the API would request the already booked dates, which would then populate the date containers with the respective colors to indicate availability. I believe, by default the date blocks are set to “unavailable” and until the API doesn’t return a callback indicating availability, the date would be rendered unavailable. However, the default CSS class that is pre-defined is green color instead. If everything works, it wouldn’t matter, but when the API is broken, it delivers a rather clunky experience. Which is why a huge number of users are complaining that the date was indicated as available, but upon clicking, it showed as unavailable.
Amazon has a pretty straightforward guide on handling CORS that can be found here.
Of course, later on, there were a slew of 50x server errors which points towards infrastructure availability problems. This enters the DevOps territory. From my understanding, vaksincovid.gov.my is being hosted on Amazon AWS which is one of the most robust and elaborate public server stack available in the world today. Heck, I like to joke that Amazon’s AWS service is so elaborate, that when I expand the navigation menu that lists all AWS services, even my 4K display doesn’t fit all of it.
But if we’re only going to pick a lightsail instance or a small EC2 stack, even the gods will not be able to save us. For something that requires potentially a whole nation swarming a particular website, perhaps a good setup that I would run would be using a couple of T-instance EC2 for the web application, load balanced by Bastion that decides which server a particular visitors should be redirected to. Amazon even allows auto-scaling where it can automatically compute the load and spawn more instances where necessary, and kill them when they’re no longer needed. The multimedia assets can be dumped into an S3 bucket. The database can be run using Amazon Aurora on their RDS which is one of the fastest database systems, built in-house by Amazon themselves, capable of running MariaDB like it was on steroids. Wrap this all in a VPC and you have a very powerful scalable architecture ready to serve the nation. For added performance, add a CloudFront CDN that ties and caches all the strings together because there is no such thing as overkill when speed is a concern on the internet sphere. All this would cost a bomb when running on Amazon, but trust me, based on the allocation that was provided, this cost would’ve been peanuts.
As far as Cloudflare is concerned, many people are judging them for using the free version instead of any other tiers. If you ask me, that is not really a problem. The free version of Cloudflare is more than sufficient to run a website comfortably. If they’re not using more than 3 page rules, and they’re extremely good in managing canonical URI structures, using proper wildcards can make 3 rules sufficient. Otherwise, the DNS management, the caching, the SSL management, and all other perks are already there for everyone to enjoy without having to upgrade. But of course, if I was given a RM70m budget, I would die to try Cloudflare’s Enterprise plan even if it was just for one month.
So what should be done?
A rather interesting discussion unfolded in the Malaysian programming community (Programming Laman Web + UI/UX) on Facebook, and there were very generous suggestions on improving the system given out there. There was one that caught my attention and I believe solves quite a bulk of the problem. Instead of letting the visitors pick a date, the date should be assigned to the visitors by the system. Prior to announcing the vaccination availability, give a time period within when the vaccination will be conducted. Then, as the users register, a date and time slot is given to the visitor which is now their responsibility to make themselves available on that date and time. It’s a pandemic vaccination we’re talking about here, not a dating appointment. It’s everyone’s responsibility to adhere to the instructions that are given. No task or job can ever get in our way when a pandemic vaccination is the concern. Employees will be forced to give exemption to the person who’s due for his shot.
On top of that, the overhead in terms of API call would be drastically reduced. Instead of going back and forth, hitting the database everytime a user lands on the page just to retrieve a list of available dates, then cross checking again when a date is selected, then later on writing the record once a confirmed date is selected, why not, upon landing on the page, a date and time is already selected in sequence of availability, tied to the session until the form is submitted? This way, the slot is booked until the form is submitted and if the session was terminated, the system knows it and can assign it to the next visitor instead. One single API call, lesser payload data to be carried, and one single final CREATE sequence to the database. If they’re offloading the API requests to Cloudflare? That’s another story. Using AWS, they didn’t really have to use Cloudflare for the API caching anyway.
Pokdepinion: Again, I’d like to emphasis that I’m not trying to be a smart arse here. I really love the web and I myself learn continuously everyday. What’s done is done, and our purpose here is education. Though, it’s a major lesson for vaksincovid.gov.my and this really should not repeat again.