We rebuilt Cloud Life's infrastructure delivery with System Initiative
31 comments
·August 27, 2025stackskipton
SteveNuts
>this blog left me with more questions than answers
Probably because it's a thinly veiled ad, I agree the post is severely lacking details.
ryanryke
Thanks for the feedback. I'm new to the platform, and certainly appreciate the interaction.
I think I described SI a bit better in another reply, and you can certainly check their website for a better description than I can give here.
I'll try to high level our particular issues to give you a sense of why this is important to us.
Traditionally, we've managed our customers via TF. I made a big push years back to try and standardize how we delivered infrastructure to our customers. We started pushing module libraries, abstract variables via yaml, and leveraged terra grunt to try and be as dry as possible. We followed along best practices to try and minimize state files for reduced blast radius etc.
What became apparent was that despite how much we tried to standardize there was always something that didn't fit between customers. So quickly each customer became a snowflake. It would have its own special version of some module or some specialized logic to match their workflow. Then over time as the modules evolved, so the questions start to come up:
- Do we go back and update every customer with the new version of the module? - Does the new module have different provider/submodule/tf version requirements? - Did the customer make some other changes to infra that aren't captured?
Making minor changes could end up taking way longer than necessary. Making large changes could be a nightmare.
In working with SI the mindset has shifted. Rather than manage the hypothetical (ie what's written in TF), let's manage the actual. Trying to reconcile in code why a container has 2cpus instead of 4, find the issue and fix it. If want to upgrade something, find it and upgrade it.
I can go into greater depth if you care or have questions, but this at a high level explains this post a bit more.
AOE9
This blog feels like a poor ad, I was hoping for technical details but seems like this tool just swept in and 'saved the day'. I have no idea how
* Provisioning time dropped from hours to minutes. * Debugging speed improved because we could fix it in real time.
happened.
Seems like the problem of a long feedback loop would have been solved by pull request preview environments and or enabling developers to have their own deployed instances for testing etc.
holoway
Ryan can give you more details about his own experience. (I'm the CEO of System Initiative) But a lot of it comes from switching to a model where you work with an AI agent alongside digital twins of the infrastructure.
In particular, debugging speed improves because you can ask the agent questions like:
`I have a website running on ec2 that is not working. Make a plan to discover all the infrastructure components that could have an impact on why I can't reach it from a web browser, then troubleshoot the issue.`
And it will discover infrastructure, evaluate the configuration, and see if it can find the issue. Then it can make the fix in a simulation, humans can review it, and you're done. It handles all the audit trails, review, state, etc for you under the hood - so the actual closing of the troubleshooting loop happens much faster as well.
AOE9
When you say 'digital twins of the infrastructure' you mean another deployed instance? So if they'd just made a preview environment created upon a pull request they'd have just got the same speed up.
> It handles all the audit trails, review, state, etc for you under the hood.
So there is no more IaC SI now manages everything?
holoway
Nope - I mean we make a 1:1 model of the real resource, and then let you propose changes to that data model. Rather than thinking of it like code in a file, think of it like having a live database that does bi-directional sync. The speedup in validating the change happens because we can run it on the data model, rather than on 'real' infrastructure.
Then we track the changes you make to that hypothetical model, and when you like it, apply the specific actions needed to make the real infrastructure conform. All the policy checking, pipeline processing, state file management, etc. is all streamlined.
holoway
And yes, there is no more IaC under the hood.
However! Folks with big IaC deployments can still use all the discovery and troubleshooting goodness, and then make the change however they want. System Initiative is fine either way.
esseph
No, not another deployed instance.
ryanryke
Thanks for the feedback. My plan is to spend a little more time to dive into the details on a follow up post.
I'll try to explain our experience here in a little better detail though.
In a traditional IAC tool (tf for example). The flow would go something like this (YMMV)
Update TF -> Plan -> PR -> Review (auto or peer) -> Merge -> TF Reviews State File -> TF Makes changes -> Updates State.
Some issues we could run into: - We support multiple customers each with their own teams that may or may not have updated infra so drift is always present.
- We support customers over time so modules and versions age, and we aren't always given the time to go make sure that past tf is updated. So version pins need to be updated among other dependencies.
Each of those could take a bit of time to resolve so that the tf plans clean and our updates are applied. Of course there are tools such as HCP Cloud, Spacelift, Terrateam etc. But, in my experience it shifts a lot of the same problems to different parts of the workflow.
The work flow with SI is closer to the following: Ask AI for a change -> AI builds a changeset (PR) -> Review -> Apply
The secret sauce is SI's "digital twin". We aren't just using AI to update code, we're actually using it to initiate changes to AWS via SI. While I would never want to have a team make changes directly to AWS without a peer review or something similar, it is sitting closer to what the actual infrastructure is. Even with changes that are happening to the infrastructure naturally.
This has allowed us to move quite a bit faster in updating and maintaining our customers infrastructure. While still sticking as close as possible to best practices.
stackskipton
So basically the product is "Custom IaC with AI agent" Sounds like a great business model if you can convince companies to go for it.
However, as SRE, pass. I'd rather keep IaC in one of our pre existing tools which much wider support and less lock in. Also, since I'm in Azure/GCP, this tool won't work for me anyways since it's AWS focused and when you go multi cloud, the difficulty ramps up pretty quickly.
holoway
It's absolutely AWS focused today - but one upside of the approach is that building the models is straightforward, because we can build a pipeline that goes from the upstream's specification, augments it with documentation and validation, etc. We'll certainly be expanding coverage.
ryanryke
Essentially. I'm not sure you could call it IAC specifically, but the same ideas apply.
Regarding lock in: I don't necessarily think there is anything here that is stopping you from writing TF and importing objects. Conversely, SI is great for importing resources into their model.
So the objects are essentially modeled in type script on the back end so support for other vendors is available. It's just whether or not they are created yet. I'll let the SI folks dive into details there.
AOE9
I think as you are a professional services company that imposes a certain workflow on you. For regular software engineering you'd just make the IaC/code deployable from the developers machine and or on a pull request take the branches code, deploy it and post back a link to the PR.
ryanryke
We're really excited about what the future holds with SI. Feel free to ask any questions.
tietjens
I have been on a small journey to try to understand what SI is. I’ve read your blog posts, listened to the Changelog show with the CEO, watched some demos and joined the Discord. But I still don’t understand what a 1:1 digital twin means. You are mirroring AWS’s api? Can you help me grok what 1:1 means concretely?
holoway
You should check out the site again today - I think it will help at least at a high level of what it's like to use System Initiative today.
We didn't recreate the AWS API. Rather than think about it as the API calls, imagine it this way. You have a real resource, say an EC2 instance. It has tons of properties, like 'ImageId', 'InstanceType', or 'InstanceId'. Over the lifetime of that EC2 instance, some of those properties might change, usually because someone takes action on that instance - say to start, stop, or restart it. That gets reflected in the 'state' of the resource. If that resource changes, you can look at the state of it and update the resource (in what is a very straightforward operation most of the time.)
The 'digital twin' (what we call a component) is taking that exact same representation that AWS has, and making a mirror of it. Imagine it like a linked copy. Now, on that copy, you can set properties, propose actions, validate your input, apply your policy, etc. You can compare it to the (constantly evolving, perhaps) state of the real resource.
So we track the changes you make to the component, make sure they make sense, and then let you review everything you (or an AI agent) are proposing. Then when it comes time to actually apply those changes to the world, we do that for you directly.
A few other upsides of this approach. One is that we don't care how a change happens. If you change something outside of System Initiative, that's fine - the resource can update, and then you can look at the delta and decide if it's beneficial or not. Because we track changes over time, we can do things like replay those changes into open change sets - basically making sure any proposed changes you are making are always up to date with the real world.
ryanryke
Feel free to reach out and I can show you.
The way I think about it is like this:
We want a representation that is as close as possible to what actually is in AWS. That way any proposed changes have a high probability of success when they are applied. SI's approach keeps an extremely up to date representation of what's in AWS.
Why do we need a representation and not just go directly to the AWS API? Among other items, it removes the capability of reviewing changes before they are applied. It gives us a safety net if you will.
tietjens
Is this representation made available to SI users? Do I have clear overview of it? I've accepted that it isn't api calls.
Ops type (DevOps/SRE/Sysadmin/whatever you want to call me) here, so I was really interested and this blog left me with more questions than answers?
What is SI? Homegrown GUI Terraform? That part is not clear in article. It looks like homegrown GUI Terraform with module so that's what I'm going with. Cool, glad you got that working, sounds like a big project and you were able to pull it off.
However, this part confused me, "Our engineers were investing a lot of time in what felt like “IaC limbo,” making a change in a Terraform file, waiting for review, waiting for CI/CD to run, and only then finding out if it worked. A simple tweak to a networking rule could take hours to validate."
What in tarnation are you doing? Do you have massive terraform file repo so apply takes forever since the plan is running forever? Talk to me Goose, what is going on that Terraform changes take hours to run? Our worst folder takes about 10 minutes to plan because it's massive "everything for this specific project". We also let people run tofu plan/apply from their laptops in Dev so feedback is instant.
We do have folders that have dependency on others folder, for example, we can't setup Azure Kubernetes without network being in place but we just left dependson yaml that our CI/CD pipelines work off when doing full rollout which is not their normal mode of operation (it's for DR only). We also assume that people have not been ClickOps either or if they have, they take responsibility for letting IaC resolve it.
Writing your own API calls to Cloud Provider is not something I would wish upon anyone. I did it for Prometheus HTTP Service Discovery system and just getting data was difficult, I can't imagine Create/Update/Delete.