Why some DVLA digital services don't work at night
39 comments
·January 12, 2025abigail95
ajnin
The batch jobs don't take 13 hours. They're just scheduled to run some time at night where the old offices used to be closed and the jobs could be ran with some expectations regarding data stability over the period. There are probably many jobs scheduled to run at 1AM then 2AM, etc, all depending on the previous to be finished so there is some large delay to ensure that a job does not start before the previous one is finished.
As to your "not a complex system" remark, when a system is built for 60 years, piling up new rules to implement new legislation and needs over time, you tend to end up with a tangled mess of services all interdependent that are very difficult to replace piece-wise with a new shiny architecturally pure one. This is closer to a distributed monolith than a microservices architecture. In my experience you can't rebuild such a thing "in 3 months". People who believe that are those that don't realize the complexity and the extraordinary amount of specifics, special cases, that are baked into the system, and any attempt to just rebuild from scratch in a few months hits that wall and ends up taking years.
abigail95
The code will be spaghettified and hideous. The queries will be nonsense.
That doesn't change the fact that the ultimate goal of the system is to manage drivers licenses.
> In my experience you can't rebuild such a thing "in 3 months".
Me and my team rebuilt the core stack for the central bank of a developing country. In 3 months. The tech started in the 70s just like this. Think bigger.
shermantanktop
It’s funny to me that I would never ask those questions. I’ve specialized in legacy rehab projects (among other things) and there seems to be no upper bound on how bad things can be or how many annoying reasons there are for why we can’t “just fix it.” Those “just” questions—which I ask too—end up being hopelessly naive. The answers will crush your soul if you let them, so you can’t let them, and you should always assume things are worse than you think.
TFA is spot on - the way to make progress is to cut problems up and deliver value. The unfortunate consequence is that badness gets more and more concentrated into the systems that nobody can touch, sort of like the evolution of a star into an eventual black hole.
abigail95
I made a lot of money moving mid size enterprises from legacy ERP systems to custom in house ones.
The DVLA dataset and the computations that are run on it can be studied and replicated in 3 months by a competent team. From there it can be improved.
There is no way that this system requires 13 hours of downtime. If it required two hours - even if the code was generated through automation it can be reverse engineered and optimized.
It is absolute rubbish that this thing is still unavailable outside of 8am-7pm.
I maintain my position that it could be replaced in 3 months.
I got my start in this business when I was in university and they told us our online learning software was going offline for 3 days for an upgrade. Those are the gatekeepers and low achievers we fight against. Think bigger.
that_guy_iain
> Edit: Why does rebuilding take a decade or more? This is not a complex system. It doesn't need to solve any novel engineering challenges to operate efficiently. Article does not give much insight into why this particular task couldn't be fixed in 3 months.
You do know the UK government has been cutting all their budgets to the bone for about 10 years? That means everywhere is pretty much understaffed.
And how do you know it's not a complex system? I would think that a system like that would be somewhat complex. It's not just driving licenses but a whole bunch of other things that are handled by the DVLA.
abigail95
The system may or may not be complex but the data is has to store and transform is not. Because it handles drivers licenses. A function that has been done on pen and paper and filing cabinets.
Study the data, study the operations, reduce complexity.
Since you imply you know more about UK budgets than I do - how much is the DVLA budgeted for IT operations like this and how much more would you give them to expect this problem solved?
I can argue real numbers but vibes about bone dry budgets I cannot.
that_guy_iain
> The system may or may not be complex but the data is has to store and transform is not. Because it handles drivers licenses. A function that has been done on pen and paper and filing cabinets.
It handles more than just driving licenses... The DVLA do more than just driving licenses.
> Since you imply you know more about UK budgets than I do - how much is the DVLA budgeted for IT operations like this and how much more would you give them to expect this problem solved?
It's not budgeted anything for this as far as I know. I believe it's handled by Government Digital Services which handles lots of the digital services for various departments. The budget for all of GDS is 90 million. A rewrite of that size I would expect to cost about 50-60 million in total but take several years.
delta_p_delta_x
Some DVLA services don't work in the day, too. Case in point, the 'get a share code' service: https://www.viewdrivingrecord.service.gov.uk/driving-record/...
pestatije
DVLA - Driver & Vehicle Licensing Agency
plus, since im already posting a comment: its because there is no batch window to process transactions
rozab
I've often ran into this when using DVLA services and spluttered with indignation. But at the end of the day, these services are fantastically usable (during the daytime) and I appreciate Dafydd pushing to just get them out there!
I got my license in 2015 so never in my life have I had the apparently ubiquitous American experience of queuing at the DMV and filling in paper forms. (is this still real? or limited to stand-up comedy?)
snakeyjake
My US state, one of the ones NOT living in the past, does almost everything online.
The only times you have to come in are:
1. for your first license, either as a newly-licensed driver or an out-of-state driver who recently moved
2. if you were bad and broke the law or otherwise had your license cancelled/revoked/suspended
Even those people have to call or go online to make an appointment.
All other tasks from getting/returning plates to requesting a duplicate title can be done online, though drop boxes, or by mail.
I have been to the DMV three times since 1995: once to turn my out-of-state license into an in-state one, once to turn that drivers license into a realID-compliant one, and once to have my fingerprints taken for a concealed carry permit.
nsxwolf
The queues have been mostly replaced with "take a number" systems where you can sit down and wait... with your... papers... that you had to fill out first...
fn-mote
> The queues have been mostly replaced with "take a number" systems where you can sit down and wait...
My recent experience was: sign up online and get a 30 min window (9:00-9:30 say). Queue everyone for that 30 minute window outside the building. At exactly 9:30, enter and go through the usual queues inside. The advantage is that getting through those queues now takes 30 minutes or less because their length is limited. Presumably we/they traded volume of processing for certainty of time spent in the queue. A very familiar tradeoff for a computer scientist.
AlotOfReading
Queuing at the DMV and filling out paperwork is very much a real thing that still happens. It's a pretty different experience in every state though.
ChocolateGod
Can it not be done online like in the UK?
neckro23
Usually, but it depends on the state. Remember, America isn’t a country, it’s 50 countries in a trenchcoat.
It’s often a mishmash of services too. I was told in-person at the DMV that I couldn’t renew my registration since I’m not the registered owner of my car. So I just went to a DMV kiosk at the local grocery store and did it there without a hassle.
ForHackernews
Unpopular opinion, but I think many systems would benefit from a regular "downtime window". Not everything needs to be 24/7 high availability.
Maybe not every night, but if you get users accustomed to the idea that you're offline for 12 hours every Sunday morning, they will not be angry when you need to be offline for 12 hours on a Sunday morning to do maintenance.
The stock market closes, more things should close. We are paying too high of a price for 99.999% uptime when 99.9% is plenty for most applications.
crazygringo
> Not everything needs to be 24/7 high availability.
If it makes you more money to be available 24/7 then why wouldn't you?
> Maybe not every night, but if you get users accustomed to the idea that you're offline for 12 hours every Sunday morning
Then I would use a competitor that was online, period.
Imagine Sunday morning if the only time you have to complete a certain school assignment, but Wikipedia is offline? Or you need to send messages to a few folks that they need to see by the evening, but the platform won't come online until 3pm, which means you'll need to interrupt your afternoon family time instead?
Maybe things closing works fine for your needs and your schedule. But it sure won't for everyone else. Having services that are reliable is one of the things that distinguishes developed countries from developing ones.
jmwilson
Who works Sunday morning then?
The maintenance window will morph into a do-big-risky-changes window, which means everybody in engineering will have to be on-call. Many years ago, when I newly joined a FAANG, I asked, "shouldn't I run this migration after hours when load is low?" and the response was firm, "No, you'll run it when people are around to fix things". It may not always be the answer, but in general, I want to do maintenance when people are present and willing to respond, not nights and weekends when they're somewhere else and can't be found.
OJFord
It only really works where the audience is already limited in country/timezone though. Sure a global service could just stagger the downtime around the world.. but (unless you've already equivalent partitioned the infrastructure) then you're just running 24/7 with arbitrary geofencing downtime on top.
kragen
Basically this happens because the DVLA and the stock market don't have any competition. Customers in a competitive market won't be angry when you need to be offline for 12 hours every Sunday morning; they'll just switch to your competitor some Sunday, because the competitor is providing them something they value that you don't provide.
ajnin
The stock markets definitely have competition. For instance Frankfurt, London, Paris or Amsterdam very much compete with each other to offer desirable conditions for investors, and companies will move their trading from one to another if it is their interest. I think the fact they close at night is a self-preservation mechanism, traders would become insane if they had to worry about their positions 24/7.
kragen
There's a very strong network effect, and most stocks are only listed on a single stock exchange, so in most contexts the competition is very minimal.
ForHackernews
Maybe they should regulate Sunday trading hours, or unionized sysadmins should negotiate the end of on-call hours.
The red queen's race that you describe for ever-greater scale, ever-greater availability is an example of the tragedy of the commons. Think how much money and many human minds have been wasted trying to squeeze out that last .0001% of "zero downtime" when they could have been creating something new.
"Keep doing the same thing, but more of it, harder" is a recipe for a barren world of monoculture.
kragen
Something like that might plausibly be correct, though you've exaggerated it to a level where it's clearly false.
If we steelman it to its most defensible essence, I think what you're saying is that the cost of the human effort needed to provide these higher uptimes exceeds the consumer benefit (the value of being able to buy a camera on Saturday), say. You could imagine, for example, that each incremental improvement in uptime wins over a proportion of the customer base providing a value that vastly exceeds its cost — but only until your competitors improve their own offering to match, so all the surplus from all this uptime improvement ultimately goes to the consumers, not the producers.
There are two related holes in this idea.
The first is that producing consumer surplus is what the economy is for, in a moral sense. The reason producing goods and services is a good thing to do is so that someone will benefit from using them! So if all the effort that sysadmins make goes into making services better for users, that's a good thing, not a bad thing.
The second is that nothing is stopping a new entrant from offering a new, low-cost service that isn't as reliable. If the cost of providing all that extra reliability (bundled into the incumbents' pricing scheme) is higher than the actual benefit to users, the users will switch to the lower-cost, less-reliable service. This has happened many times, in fact: less-reliable minicomputers stole business from mainframes, less-reliable VoIP stole business from ATM and SONET and SDH, all kinds of less-reliable plastic goods have stolen business from all-metal versions, and now solar panels are stealing business from coal power plants even though solar panel "uptime" is like 30%.
So the particular market dynamics we're talking about actually sensitively optimize the amount of effort given to uptime to the economic optimum. There do exist lots of market failures, but the particular dynamic we're discussing is the opposite extreme from something like a dollar auction.
lifeoflejf
Bergen county NJ has blue laws that make it so non-grocery stores must be closed on Sunday’s. Maybe there’s some value in structuring a time where everybody is off?
Just like at work the only time I really get off is when all of my customers are off. It’s nice when the industry sorta shuts off for a week or so around christmas
abigail95
Who is trying to achieve zero downtime? Facebook has degraded service regularly it's just close enough to 99.9 that nobody cares.
If loading my messages times out I just move onto something else and go back a few minutes later.
Surely they have metrics measuring that and don't think it's worth the engineering effort to improve it.
mike_hearn
tl;dr same reason other services go offline at night: concurrency is hard and many computations aren't thread safe, so need to run serially against stable snapshots of the data. If you don't have a database that can provide that efficiently you have no choice but to stop the flow of inbound transactions entirely.
Sounds like Dafydd did the right thing in pushing them to deliver some value now and not try to rebuild everything right away. A common mistake I've seen some people make is assuming that overnight batch jobs that have to shut down the service are some side effect of using mainframes, and any new system that uses newer tech won't have that problem.
In reality getting rid of those kinds of batch jobs is often a hard engineering project that requires a redesign of the algorithms or changes to business processes. A classic example is in banking where the ordering of these jobs can change real world outcomes (e.g. are interest payments made first and then cheques processed, or vice-versa?).
In other cases it's often easier for users to understand a system that shuts down overnight. If the rule is "things submitted by 9pm will be processed by the next day" then it's easy to explain. If the rule is "you can submit at any time and it might be processed by the next day", depending on whether or not it happens to intersect the snapshot taken at the start of that particular batch job, then that can be more frustrating than helpful.
Sometimes the jobs are batch just because of mainframe limitations and not for any other reason, those can be made incremental more easily if you can get off the mainframe platform to begin with. But that requires rewriting huge amounts of code, hence the popularity of emulators and code transpilers.
abigail95
Do you know why the downtime window hasn't been decreasing over time as it gets deployed onto faster hardware over the years?
Nobody would care or notice if this thing had 99.5% availability and went read only for a few minutes per day.
pjc50
Maybe it isn't running on faster hardware? These systems are often horrifyingly outdated.
mike_hearn
It doesn't get deployed onto faster hardware. Mainframes haven't really got faster.
abigail95
It must be. Maintaining the original hardware would be more expensive that upgrading to compatible but faster systems.
ndriscoll
Mainframes have absolutely gotten faster. They're basically small supercomputers.
ndriscoll
Getting rid of batch jobs shouldn't be a goal; batch processing is generally more efficient as things get amortized, caches get better hit ratios, etc.
What software engineers should understand is there's no reason a batch can't take 3 ms to process and run every 20 ms. "Batch" and "real-time" aren't antonyms. In a language/framework with promises and thread-safe queues it's easy to turn a real time API into a batch one, possibly giving an order of magnitude increase in throughput.
mike_hearn
Batch size is usually fixed by the business problem in these scenarios, I doubt you can process them in 3msec if the job requires reading in every driving license in the country and doing some work on them for instance.
ndriscoll
This particular thing might be difficult to change because it's 50 year old COBOL or whatever, but my point was more that I've encountered pushes from architects to "eliminate batches" and it makes no sense. It just means that now I have to re-batch things in my code. The correct way to think about it is that you want smaller, more frequent batches.
Do they really need to do work on all records every night? Probably not. Most people aren't changing their license or vehicle info most days. So the problem is that somewhere they're (conceptually) doing a table scan instead of using an index. That might still be hard to fix, but at least identify the correct problem. Otherwise as you say moving to different tech won't fix it.
Something is missing here, why do batch jobs take 13 hours? If this thing was started on an old mainframe why isn't the downtime just 5 minutes at 3:39 AM?
Exactly how much data is getting processed?
Edit: Why does rebuilding take a decade or more? This is not a complex system. It doesn't need to solve any novel engineering challenges to operate efficiently. Article does not give much insight into why this particular task couldn't be fixed in 3 months.