I'll think twice before using GitHub Actions again

283 comments

·January 20, 2025

arghwhat

> no way of running actions locally

My policy is to never let pipeline DSLs contain any actual logic outside orchestration for the task, relying solely on one-liner build or test commands. If the task is more complicated than a one-liner, make a script for it in the repo to make it a one-liner. Doesn't matter if it's GitHub Actions, Jenkins, Azure DevOps (which has super cursed yaml), etc.

This in turn means that you can do what the pipeline does with a one-liner too, whether manually, from a vscode launch command, a git hook, etc.

This same approach can fix the mess of path-specific validation too - write a regular script (shell, python, JS, whatever you fancy) that checks what has changed and calls the appropriate validation script. The GitHub action is only used to run the script on PR and to prepare the CI container for whatever the script needs, and the same pipeline will always run.

Tainnor

The reason why many CI configs devolve into such a mess isn't typically that they don't extract complicated logic into scripts, it's about all the interactions with the CI system itself. This includes caching, sharing of artifacts, generating reports, configuring permissions, ordering of jobs, deciding when which jobs will run, deciding what to do when jobs fail, etc. All of this can get quite messy in a large enough project.

pydry

It never becomes unbearably messy this way though.

The reason it gets unbearably messy is because most people google "how to do x in github actions" (e.g. send a slack message) and there is a way and it's almost always worse than scripting it yourself.

SOLAR_FIELDS

The reason it gets unbearably messy is that GitHub has constructed an ecosystem that encourages developers to write Turing complete imperative behavior into YAML without providing the same language constructs/tooling that a proper adult language provides to encourage code reuse and debugging.

Without tooling like this any sufficiently complex system is guaranteed to evolve into a spaghetti mess, because no sane way exists to maintain such a system at scale without proper tooling, which one would need to hand roll themselves against a giant, ever changing mostly undocumented black box proprietary system (GitHub Actions). Someone tried to do this, the project is called “act”. The results are described by the author in the article as “subpar”.

The only sane way to use GitHub Actions at scale is to take the subset of its features that you leverage to perform the execution (event triggers, runs-on, etc) and only use those features, and farm out all the rest of the work in something that is actually maintainable eg Buildkit, Bazel, Gradle etc

DanielHB

Caching and sharing artifacts is usually the main culprit. My company has been using https://nx.dev/ for that. It works locally as well and CI and it just works.

Our NX is pointed to store artifacts in GHA, but our GHA scripts don't do any caching directly, it is all handled by NX. It works so well I would even consider pulling a nodejs environment to run it in non-nodejs projects (although I haven't tried, probably would run into some problems).

It is somewhat heavy on configuration, but it just moves the complexity from CI configuration to NX configuration (which is nicer IMO). Our CI pipelines are super fast if you don't hit one of one of our slow compilling parts of the codebase.

With NX your local dev environment can pull cached items that were built from previous CI ran-jobs or other devs. We have some native C++ dependencies that are kind of a pain to build locally, our dev machines can pull the built binaries built by other devs (since all devs and CI also share the same cache-artifacts storage). So it makes developing locally a lot easier as well, I don't even remember last time I had to build the native C++ stuff myself since I don't work on it.

anbotero

Do you know the criteria used to pick the nx.dev? That is, do you pay for their Cloud, or do you do some plumbing yourselves to make it work on GitHub and other things?

Looks interesting. We’ve picked tools based on time saved without too much extra knowledge or overhead required, so this may prove promising.

bluGill

You should generate your report with regular scripts. You need ci config to deploy them but that is the only part that should be different.

evilduck

This doesn’t really work when you start sharding tests.

hinkley

I’ll go so far as to say the massive add on/plugin list and featuritis of CI/CD tools is actively harmful to the sanity of your team.

The only functionality a CI tool should be providing is:

- starting and running an environment to build shit in

- accurately tracking success or failure

- accurate association of builds with artifacts

- telemetry (either their own or integration) and audit trails

- correlation with project planning software

- scheduled builds

- build chaining

That’s a lot, but it’s a lot less than any CI tool made in the last 15 years does, and that’s enough.

There’s a big difference for instance between having a tool that understands Maven information enough to present a build summary, and one with a Maven fetch/push task. The latter is a black box you can’t test locally, and your lead devs can’t either, so when it breaks, it triggers helplessness.

If the only answer to a build failure is to stare at config and wait for enlightenment, you fucked up.

0xbadcafebee

100%. The ci/cd job should be nothing more than a wrapper around the actual logic which is code in your repo.

I write a script called `deploy.sh` which is my wrapper for my ci/cd jobs. It takes options and uses those options to find the piece of code to run.

The ci/cd job can be parameterized or matrixed. The eventually-run individual jobs have arguments, and those are passed to deploy.sh. Secrets/environment variables are set from the ci/cd system, also parameterized/matrixed (or alternately, a self-hosted runner can provide deploy.sh access to a vault).

End result: from my laptop I can run `deploy.sh deploy --env test --modules webserver` to deploy the webserver to test, and the CI/CD job also runs the same job the same way. The only thing I maintain that's CI/CD-specific is the GitHub Action-specific logic of how to get ready to run `deploy.sh`, which I write once and never change. Thus I could use 20 different CI/CD systems, but never have to refactor my actual deployment code, which also always works on my laptop. Vendor lock-in is impossible, thanks to a little abstraction.

(If you have ever worked with a team with 1,000 Jenkins jobs and the team has basically decided they can never move off of Jenkins because it would take too much work to rewrite all the jobs, you'll understand why I do it this way)

hardwaresofton

Hey if you’ve never heard of it consider using just[0], it’s a better makefile and supports shell scripting explicitly (so at least equivalent in power, though so is Make)

[0]: https://github.com/casey/just

chubot

The shell also supports shell scripting! You don't need Just or Make

Especially for Github Actions, which is stateless. If you want to reuse computation within their VMs (i.e. not do a fresh build / test / whatever), you can't rely on Just or Make

A problem with Make is that it literally shells out, and the syntax collides. For example, the PID in Make is $$$$, because it's $$ in shell, and then you have to escape $ as $$ with Make.

I believe Just has similar syntax collisions. It's fine for simple things, but when it gets complex, now you have {{ just vars }} as well as $shell_vars.

It's simpler to "just" use shell vars, and to "just" use shell.

Shell already has a lot of footguns, and both Just and Make only add to that, because they add their own syntax on top, while also depending on shell.

0xbadcafebee

Thank you, I have seen it, but I prefer Make.

OldOneEye

I discovered Just with a similar comment in Hacker News and I want to add my +1.

It is so much better to run scripts with Just than it is doing it with Make. And although I frankly tend to prefer using a bash script directly (much as described by the parent commenter), Just is much less terrible than Make.

Now the only problem is convincing teams to stop following the Make dogma, because it is so massively ingrained and it has so many probems and weirdnesses that just don't add anything if you just want a command executor.

The PHONY stuff, the variable scaping, the every-line-is-a-separate-shell, and just a lot of stuff that don't help at all.

jicea

I don't understand why this is not the evident approach for everyone writing GitHub Actions/GitLab CI/CD yaml etc....

I've struggled in some teams to explained why it's better to extract your command in scripts (ShellCheck on it, scripts are simple to run locally etc...) instead of writing a Frankenstein of YAML and shell commands. I hope someday to find an authoritative guidelines on writing pipeline that promote this approach so at least I can point to this link instead of defending myself being a dinosaur!

mrweasel

In a previous job we had a team tasked with designing these "modern" CI/CD pipeline solutions, mostly meant for Kubernetes, but it was suppose to work for everything. They had such a hard on for tools that would run each step as a separate isolated task and did not want pipelines to "devolve" into shell scripts.

Getting anything done in such environments are just a pain. You spend more time fighting the systems than you do actually solving problems. It is my opinion that a CI/CD system needs just the following features: Triggers (source code repo, http endpoints or manually triggered), secret management and shell script execution. That's it, you can build anything using that.

eddd-ddde

I think what they really wanted was something like bazel. The only real benefit I can think right now for not "devolving" into shell scripts is distributed caching with hermetic builds. It has very real benefits but it also requires real effort to work correctly.

datavirtue

I just joined as the enterprise architect for company that has never had one. There is an existing devops team that is making everyone pull their hair out and I haven't had a single spare minute to dig in on their mess but this sounds early familiar.

chrisweekly

Mostly agreed, but (maybe orthogonal) IME, popular CI/CD vendors like TeamCity* can make even basic things like shell script execution problematic.

* TC offers sh, full stop. If you want to script something that depends on bash, it's a PITA and you end up with a kludge to run bash in sh in docker in docker.

null

[deleted]

arghwhat

My "favorite" is when I see people go all in, writing thousands of lines of Jenkins-flavor Groovy that parses JSON build specifications of arbitrary complexity to sort out how to build that particular project.

"But then we can reuse the same pipeline for all our projects!"

ozim

I think that is pitfall of software devs.

For me it was an epiphany as software dev - not to write reusable extensible scripts - I am so much more productive after that.

baby_souffle

> "But then we can reuse the same pipeline for all our projects!"

oh god just reading that gave me PTSD flash backs.

At $priorGig there was the "omni-chart". It was a helm chart that was so complex it needed to be wrapped in terraform and used composable terraform modules w/ user var overrides as needed.

Debugging anything about it meant clearing your calendar for the day and probably the following day, too.

lowercased

I can rarely reuse the same pipeline for the same project 6 months down the road, much less reuse for anything else.

The few bits that end up getting reused are the externalized bash scripts.

0xbadcafebee

I think I can summarize it in a rough, general way.

  CI/CD is a method to automate tasks in the background that you would otherwise run on your laptop. The output
  of the tasks are used as quality gates for merging commits, and for deployments.
  
  - Step 1. Your "laptop in the cloud" requires some configuration (credentials, installed software, cached artifacts)
    before a job can be run.
    - Requires logic specific to the CI/CD system
  
  - Step 2. Running many jobs in parallel, passing data from step to step, etc requires some instructions.
    - Requires logic specific to the CI/CD system
  
  - Step 3. The job itself is the execution of a program (or programs), with some inputs and outputs.
    - Works the same on any computer (assuming the same software, environment, inputs, etc)
    - Using a container in Step 1. makes this practical and easy
  
  - Step 4. After the job finishes, artifacts need to be saved, results collected, and notifications sent.
    - Some steps are specific to the CI/CD system, others can be a reusable job
  
  Step 3 does not require being hard-coded into the config format of the CI/CD system. If it is instead
  just executable code in the repo, it allows developers to use (and work on) the code locally without
  the CI/CD system being involved. It also allows moving to a different CI/CD system without ever rewriting
  all the jobs; the only thing that needs to be rewritten are the CI/CD-specific parts, which should be
  generic and apply to all jobs pretty much the same.
  
  Moving the CI/CD-specific parts to a central library of configuration allows you to write some code
  once and reuse it many times (making it DRY). CircleCI Orbs, GitHub Actions, Jenkins Shared Libraries/
  Groovy Libraries, etc are examples of these. Write your code once, fix a bug once, reuse it everywhere.

maccard

To make the thing actually fast at scale, a lot of the logic ends up being specific to the provider; requiring tokens, artifacts etc that aren't available locally. You end up with something that tries to detect if you're running locally or in CI, and then you end up in exactly the same situation.

Lutger

You are right, and this is where a little bit of engineering comes in. Push as much of the logic to scripts (either shell or python or whatever) that you can run locally. Perhaps in docker, whatever. All the token, variables, artifacts etc should act as inputs or parameters to your scripts. You have several mechanisms at your disposal, command line arguments, environment variables, config files, etc. Those are all well understood, universal, language and environment agnostic, to an extent.

The trick is to NOT have your script depend on the specifics of the environment, but reverse the dependency. So replace all `If CI then Run X else if Local Run Y` with the ability to configure the script to run X or Y, then let the CI configure X and local configure Y. For example.

I'm not saying it is always easy and obvious. For bigger builds, you often really want caching and have shitloads of secrets and configurations going on. You want to only build what is needed, so you need something like a DAG. It can get complex fast. The trick is making it only as complex as it needs be, and only as reusable as and when it is actually re-used.

harrall

A shell script has many extremely sharp edges like dealing with stdin, stderr, stdout, subprocesses, exit codes, environmental variables, etc.

Most programmers have never written a shell script and writing CI files is already frustrating because sometimes you have to deploy, run, fix, deploy, run, fix, which means nobody is going to stop in the middle of that and try to learn shell scripting.

Instead, they copy commands from their terminal into the file and the CI runner takes care of all the rough edges.

I ALWAYS advise writing a shell script but I know it's because I actually know how to write them. But I guess that's why some people are paid more big bux.

eru

GitHub's CI yaml also accepts eg Python. (Or anything else, actually.)

That's generally a bit less convenient, ie it takes a few more lines, but it has significantly fewer sharp edges than your typical shell script. And more people have written Python scripts, I guess?

dpkirchner

This all reminds me of the systemd ini-like syntax vs shell scripts debate. Shell scripts are superior, of course, but they do require deeper knowledge of unix-like systems.

slt2021

yeah if you author CI jobs, you should know linux, otherwise a person should not even touch the CI system with 10ft pole

eru

> [...] instead of writing a Frankenstein of YAML and shell commands.

The 'Frankenstein' approach isn't what makes it necessarily worse. Eg Makefiles work like that, too, and while I have my reservations about Make, it's not really because they embed shell scripts.

arccy

it can be quite hard to write proper scripts that work consistently... different shells have different behaviours, availability of local tools, paths, etc

and it feels like fighting against the flow when you're trying to make it reusable across many repos

akdev1l

Containerize the build environment so everything is captured (dependencies, build tools, etc)

eru

Pick a single shell and treat it like a programming language.

Or write your stuff in eg Python in the first place. GitHub's CI yaml supports scripts in arbitrary languages, not just shell.

ukoki

> My policy is to never let pipeline DSLs contain any actual logic outside orchestration for the task,

I call this “isomorphic CI” — ie: as long as you set the correct env vars, it should run identically on GitHub actions, Jenkins, your local machine, a VM etc

reactordev

This is the only DevOps way. Abstract the build into a single step.

vvillena

And yet, you would be surprised at the amount of people who react like that's an ignorant statement ("not feasible in real world conditions"), an utopic goal ("too much time to implement"), an impossible feat ("automation difficults human oversight"), or, my favorite, the "this is beneath us" excuse ("see, we are special and this wouldn't work here").

Automation renders knowledge into a set of executable steps, which is much better than rendering knowledge into documentation, or leaving it to rot in people's minds. Compiling all rendered knowledge into a single step is the easiest way to ensure all elements around the build and deployment lifecycle work in unison and are guarded around failures.

jamesfinlayson

Yep. I remember at a previous company multiple teams had manually created steps in TeamCity (and it wasn't even being backed up in .xml files).

I just did my own thing and wrapped everything deploy.sh and test.sh and when the shift to another system came... well it was still kind of annoying, but at least I wasn't recreating the whole thing.

NilMostChill

i like this term

alkonaut

That’s usually very hard or impossible for many things. The AzDo yaml consists of a lot of steps that are specific to the CI environment (fetching secrets, running tests on multiple nodes, storing artifacts of various kinds).

Even if the ”meat” of the script is a single build.ps oneliner, I quickly end up with 200 line yaml scripts which have no chance of working locally.

arghwhat

Azure DevOps specifically has a very broken approach to YAML pipelines, because they effectively took their old graphical pipeline builder and just made a YAML representation of it.

The trick to working with this is that you don't need any of their custom Azure DevOps task types, and can use the shell type (which has a convenient shorthand) just as well as in any other CI environment. Even the installer tasks are redundant - in other CI systems, you either use a container image with what you need, or install stuff at the start, and Azure DevOps works with both of these strategies.

So no, it's neither hard nor impossible, but Microsoft's half-assed approach to maintaining Azure DevOps and overall overcomplicated legacy design makes it a bit hard to realize that doing what their documentation suggests is a bad idea, and that you can use it in a modern way just fine. At least their docs do not recommend that you use the dedicated NPM-type task for `npm install` anymore...

(I could rant for ages about Azure DevOps and how broken and unloved it is from Microsoft's side. From what I can tell, they're just putting in the minimum effort to keep old Enterprise customers that have been there through every rename since Team Foundation Server from jumping ship - maybe just until Github's enterprise side has matured enough? Azure DevOps doesn't even integrate well with Azure, despite its name!)

noen

It has been on life support for a long time AFAIK. I designed Visual Studio Online (the first launch of AzDO) - and every engineer, PM, and executive I worked with is either in leadership at GitHub or retired.

mplanchard

This doesn’t seem to address the parent comment’s point at all, which was about required non-shell configuration such as for secrets, build parallelism, etc.

rtpg

The actual subtle issue here is that sometimes you actually need CI features around caching and the like, so you are forced to engage with the format a bit.

You can, of course, chew it down to a bare minimum. But I really wish more CI systems would just show up with "you configure us with scripts" instead of the "declarative" nonsense.

Guvante

CI that isn't running on your servers wants very deep understanding of how your process works so they can minimize their costs (this is true whether or not you pay for using CI)

rtpg

Totally! It's a legitimate thing! I just wish that I had more tools for dynamically providing this information to CI so that it could work better but I could also write relatively general tooling with a general purpose language.

The ideal for me is (this is very silly and glib and a total category error) LSP but for CI. Tooling that is relatively normalized, letting me (for example) have a pytest plugin that "does sharding" cleanly across multiple CI operators.

There's some stuff and conventions already of course, but in particular caching and spinning up jobs dynamically are still not there.

lloeki

I'm increasingly designing CI stuff around rake tasks. Then I run rake in the workflow.

But that caters only for each individual command... as you mention the orchestration is still coded in, and duplicated from what rake knows and would do.

So I'm currently trying stuff that has a pluggable output: one output (the default) is that it runs stuff, but with just a rake var, instead of generating then running commands it generates workflow content that ultimately gets merged in an ERB workflow template.

The model I like the most though is Nix-style distributed builds: it doesn't matter if you do `nix build foo#bar` (local) or `nix build -j0 foo#bar` (zero local jobs => use a remote builder†), the `foo#bar` "task" and its dependents gets "built" (a.k.a run).

† builders get picked matching target platform and label-like "features" constraints.

Ever since there has been gitlab-runner, I've wondered why the hell can't I just submit some job to a (list of) runner(s) - some of which could be local - without the whole push-to-repo+CI orchestrator? I mean I don't think it would be out of this world to write a CLI command that locally parses whatever-ci.yml, creates jobs out of it, and submit them to a local runner.

benrutter

Oh boy, there's a special kind of hell I enter into everytime I set up new github actions. I wrote a blog post a few months ago about my pain[0] but one of the main things I've found over the years is you can massively reduce how horrible writing github actions is by avoiding prebuilt actions, and just using it as a handy shell runner.

If you write behaviour in python/ruby/bash/hell-rust-if-you-really-want and leave your github action at `run: python some/script.py` then you'll have something that's much easy to test locally, and save yourself a lot of pain, even if you wind up with slightly more boilerplate.

[0] https://benrutter.github.io/posts/github-actions/

riperoni

At this point, just pause with Github Actions and compare it to how GiLab handles CI.

Much more intuitive, taking shell scripts and other script commands natively and not devolving into a mess of obfuscated typescript wrapped actions that need a shit ton of dependencies.

Aeolun

The problem with Gitlab CI is that now you need to use Gitlab.

I’m not even sure when I started feeling like that was a bad thing. Probably when they started glueing a bunch of badly executed security crud onto the main product.

lolinder

The earliest warning sign I had for GitLab was when they eliminated any pricing tier below their equivalent of GitHub's Enterprise tier.

That day, they very effectively communicated that they had decided they were only interested in serving Enterprises, and everything about their product has predictably degraded ever since, to the point where now they're now branding themselves "the most comprehensive AI-powered DevSecOps Platform" with a straight face.

Espressosaurus

GitLab can't even show you more than a few lines of context without requiring you to manually click a bunch of times. Forget the CI functionality, for pull requests it's absolutely awful.

plagiarist

I decided it was a bad thing when they sent password reset emails to addresses given by unauthenticated users. Not that I ever used them. But now it is a hard no, permanently.

They have since had other also severe CVEs. That has made me feel pretty confident in my decision.

danillonunes

But you can do the same with GitHub, right? Although most docs and articles focus on 3rd party actions, nothing stops you to just run everything in your own shell script.

lolinder

Yes, you can, and we do at my current job. Much of the time it's not even really the harder approach compared to using someone else's action, it's just that the existence of third party actions makes people feel obliged to use them because they wouldn't want to be accused of Not Invented Here Syndrome.

arccy

if anything, gitlab's ci seems even worse...

Imustaskforhelp

theoretically we could also use https://just.systems/ or https://mise.jdx.dev/ instead of directly calling gh actions but I haven't tried gh actions personally yet , If its really the nightmare you are saying , then that's sad.

jbaber

I had this idea the other day when dealing with CI and thought it must be dumb because everyone's not already doing it. It would make your CI portable to other runners in future, too.

spooneybarger

A lot of folks in this thread are focusing on the monorepo aspect of things. The "Pull request and required checks" problem exists regardless of monorepo or not.

GitHub Actions allows you to only run checks if certain conditions are met, like "only lint markdown if the PR contains *.md files". The moment you decide to use such rules, you have the "Pull request and required checks" problem. No "monorepo" required.

GitHub required checks at this time allow you to use with external services where GitHub has no idea what might run. For this reason, required checks HAVE to pass. There's no "if it runs" step. A required check on an external service might never run, or it might be delayed. Therefore, if GH doesn't have an affirmation that it passed, you can't merge.

It would be wonderful if for jobs that run on GH where GH can know if the action is supposed to run, if required checks could be "require all these checks if they will be triggered".

I have encountered this problem on every non-trivial project I use with GitHub actions; monorepo or not.

saxonww

This isn't really the problem, though. This is an easy problem to solve; the real problem is that it costs money to do so.

Also: I'm not asserting that the below is good, just that it works.

First, don't make every check a required check. You probably don't need to require that linting of your markdown files passes (maybe you do! it's an example).

Second, consider not using the `on:<event>:paths`, but instead something like `dorny/paths-filter`. Your workflow now runs every time; a no-op takes substantially less than 1 minute unless you have a gargantuan repo.

Third, make all of your workflows have a 'success' job that just runs and succeeds. Again, this will take less than 1 minute.

At this point, a no-op is still likely taking less than 1 minute, so it will bill at 1 minute, which is going to be $.008 if you're paying.

Fourth, you can use `needs` and `if` now to control when your 'success' job runs. Yes, managing the `if` can be tricky, but it does work.

We are in the middle of a very large migration into GitHub Actions from a self-hosted GitLab. It was something we chose, but also due to some corporate choices our options were essentially GitHub Actions or a massive rethink of CI for several dozen projects. We have already moved into code generation for some aspects of GitHub Actions code, and that's the fifth and perhaps final frontier for addressing this situation. Figure out how to describe a graph and associated completion requirements for your workflow(s), and write something to translate that into the `if` statements for your 'success' jobs.

p1necone

There's a workaround for the 'pull request and required check' issue. You create an alternative 'no op' version of each required check workflow that just does nothing and exits with code 0 with the inverse of the trigger for the "real" one.

The required check configuration on github is just based off of job name, so either the trigger condition is true, and the real one has to succeed or the trigger condition is false and the no op one satisfies the PR completion rules instead.

It seems crazy to me that such basic functionality needs such a hacky workaround, but there it is.

dayjaby

Or you can just check if the step was skipped. I don't get the point of the article.

Managing a monorepo with acyclic dependencies is super easy: dornys path filter in one job and the other jobs check

1. whether their respective or any dependency's path got changed 2. and all dependency jobs were either successful or skipped.

Done. No need to write an article.

nunez

Posts like this make me miss Travis. Travis CI was incredible, especially for testing CI locally. (I agree with the author that act is a well done hack. I've stopped using it because of how often I'd have something pass in act and fail in GHA.)

> GitHub doesn't care

My take: GitHub only built Actions to compete against GitLab CI, as built-in CI was taking large chunks of market share from them in the enterprise.

sureIy

To be fair, GitHub also charges for Actions minutes and storage, so it's one of the few pieces that do generate revenue.

homebrewer

Woodpecker supports running jobs on your own machine (and conveniently provides a command to do that for failed jobs), uses the same sane approach of passing your snippets to the shell directly (without using weird typescript wrappers), and is pluggable into all major forges, GitHub included.

chubot

How so? I don’t recall this, and I used Travis, and then migrated to GitHub actions.

As far as I can tell, they are identical as far as testing locally. If you want to test locally, then put as much logic in shell scripts as possible, decoupled from the CI.

ryanisnan

One really interesting omission to this post is how the architecture of GitHub actions encourages (or at the very least makes deceivingly easy) making bad security decisions.

Common examples are secrets. Organization or repository secrets are very convenient, but they are also massive security holes just waiting for unsuspecting victims to fall into.

Repository environments have the ability to have distinct secrets, but you have to ensure that the right workflows can only access the right environments. It's a real pain to manage at scale.

Being able to `inherit` secrets also is a massive footgun, just waiting to leak credentials to a shared action. Search for and leak `AWS_ACCESS_KEY_ID` anyone?

Cross-repository workflow triggering is also a disaster, and in some circumstances you can abuse the differences in configuration to do things the source repository didn't intend.

Other misc. things about GHA also are cool in theory, but fall down in practice. One example is the wait-timer concept of environments. If you have a multi-job workflow using the same environment, wait-timer applies to EACH JOB in the environment. So if you have a build-and-test workflow with 2 jobs, one for build, and one for test, each step will wait `wait-timer` before it executes. This makes things like multi-environment deployment pipelines impossible to use this feature, unless you refactor your workflows.

Overall, I'd recommend against using GHA and looking elsewhere.

mdaniel

> Search for and leak `AWS_ACCESS_KEY_ID` anyone?

Well that's just someone being a dumbass, since AssumeRoleWithWebIdentity (and its Azure and GCP equivalent) have existed for quite a while. It works flawlessly and if someone does do something stupid like `export HURP_DURP=$AWS_ACCESS_KEY_ID; printenv` in a log, that key is only live for about 15 minutes so the attacker better hurry

Further, at least in AWS and GCP (I haven't tried such a thing in Azure) on can also guard the cred with "if the organization and repo are not ..." then the AssumeRole 403s to ensure that my-awesome-org/junior-dev-test-repo doesn't up and start doing fun prod stuff in GHA

I hate GHA probably more than most, but one can footgun themselves in any setup

joshstrange

I never knew how easy it was to setup role assuming for AWS/GHA. It’s much easier than managing the access/secret.

I wrote a little about it in this blog post: https://joshstrange.com/2024/04/26/nightly-postgres-backups-...

mdaniel

If you, or others, are interested I have found that those role-session-name variables make for a great traceability signal when trying to figure out what GHA run is responsible for AWS actions. So instead of

  role-session-name: GitHubActionSession

one can consider

  role-session-name: gha-${{ github.run_id }}  # or your favorite

I don't this second recall what the upper limit is on that session name so you may be able to fit quite a bit of stuff in there

ryanisnan

Great points. I totally agree, don't use hard-coded static creds, especially here. But in reality, many services and/or API keys don't support OIDC or short-lived credentials, and the design of secrets in GitHub promote using them, in my opinion.

junto

Whilst I do detest much of Azure DevOps, one thing I do like about their pipelines is that we can securely use service connections and key vaults in Azure to secure pipeline tasks that require credentials to be managed securely.

wrboyce

While I do agree with you regarding encouraging bad secret management practices, one fairly nice solution I’ve landed on is using terraform to manage such things. I guess you could even take it a step further to have a custom lint step (running on GHA, naturally) that disallows secrets configured in a certain manner and blocks a deploy (again, on GHA) on failure.

I guess what I’m saying is, it’s GHA all the way down.

maccard

What’s your suggestion for not-GHA?

ripped_britches

My man/woman - you gotta try buildkite. It’s a bit more extra setup since you have to interface with another company, more API keys, etc. But when you outgrow GH actions, this is the way. Have used buildkite in my last two jobs (big US tech companies) and it has been the only pleasant part of CI.

habosa

I've use Jenkins, Travis, Circle, Cirrus, GitHub Actions, and Buildkite. Buildkite is leagues ahead of all of the others. It's the only enjoyable CI system I've used.

cjk

This is indeed the way.

bramblerose

In the end, this is the age old "I built by thing on top of a 3rd party platform, it doesn't quite match my use case (anymore) and now I'm stuck".

Would GitLab have been better? Maybe. But chances are that there is another edge case that is not handled well there. You're in a PaaS world, don't expect the platform to adjust to your workflow; adjust your workflow to the platform.

You could of course choose to "step down" (PaaS to IaaS) by just having a "ci" script in your repo that is called by GA/other CI tooling. That gives you immense flexibility but also you lose specific features (e.g. pipeline display).

thayne

The problem is that your "ci" script often needs some information from the host system, like what is the target git commit? Is this triggered by a pull request, or a push to a branch? Is it triggered by a release? And if so, what is the version of the release?

IME, much of the complexity in using Github Actions (or Gitlab CI, or Travis) is around communicating that information to scripts or build tools.

That and running different tasks in parallel, and making sure everything you want passes.

perlgeek

> Would GitLab have been better?

My impression of gitlab CI is that it's also not built for monorepos.

(I'm a casual gitlab CI user).

dezgeg

I'm not sure if there's a monorepo vs polyrepo difference; just that anything complex is pretty painful in gitlab. YAML "programming" just doesn't scale.

Hamuko

Doesn't everything in GitLab go into a single pipeline? GitHub at least makes splitting massive CI/CD setups easier by allowing you to write them as separate workflows that are separate files.

dijksterhuis

> GitHub at least makes splitting massive CI/CD setups easier by allowing you to write them as separate workflows that are separate files.

this makes me feel like you’re really asking “can i split up my gitlab CICD yaml file or does everything need to be in one file”.

if that’s the case:

yes it does eventually all end up in a single pipeline (ignoring child pipelines).

but you can split everything up and then use the `include` statement to pull it all together in one main pipeline file which makes dealing with massive amounts of yaml much easier.

https://docs.gitlab.com/ee/ci/yaml/includes.html

you can also use `include` to pull in a yaml config from another project to add things like SAST on the fly.

previous workplace i had like 4 CICD template repos and constructed all 30 odd actual build repos from those four templates.

used `include` to pull in some yaml template jobs, which i made run when by doing something like (it’s been a while, might get this wrong)

    include:
      project: 'cicd/templates'
      file: 'builds.yml'


    stages:
      - build

    job_a:
      stage: build
      extends: .job_a_from_template
      variables:
        IMAGE_NAME: "myimage"
        IMAGE_REPO: "somerepo.org"

this doesn’t run anything for `job_b_from_template` … you just end up defining the things you want to run for each case, plus any variables you need to provide / override.

you can also override stuff like rules on when it should run if you want to. which is handy.

gitlab CICD can be really modular when you get into it.

if that wasn’t the case: on me.

edit: switched to some yaml instead of text which may or may not be wrong. dunno. i have yet to drink coffee.

dijksterhuis

addendum you can also do something like this, which means you don’t have to redefine every job in your main ci file, just define the ones you don’t want to run

    include:
      project: 'cicd/templates'
      file: 'builds.yml'

    variables:
      IMAGE_NAME: something
      IMAGE_REPO: some.org

    job_b:
      rules:
        - when: never

where the template you import has a job_a and job_b definition. both get pulled in, but job_b gets overwritten so it never runs.

less useful when just splitting things into multiple files to make life simpler.

super useful when using the same templates across multiple independent repositories to make everything build in as close to the same way as possible.

dezgeg

You can have pipelines trigger child pipelines in gitlab, but usability of them is pretty bad, viewing logs/results of those always needs extra clicking.

tevon

I call writing GitHub Actions "Search and Deploy", constantly pushing to a branch to get an action to run is a terrible pattern...

You'd think, especially with the deep VS Code integration, they'd have at least a basic sanity-check locally, even if not running the full pipeline.

8n4vidtmkvmk

Not just me then? I was trying to fix a GitHub action just today but I have no clue how I'm supposed to tear it, so I just keep making tiny changes and pushing.... Not a good system but I'm still within the free tier so I'm willing to put up with it I guess.

masklinn

I think it’s everyone, debugging GH actions is absolute hell, and it gets terrifying when the action interacts with the world (e.g. creating and deploying packages to a registry).

oefrha

> it gets terrifying when the action interacts with the world (e.g. creating and deploying packages to a registry).

To be fair, testing actions with side effects on the wider world is terrifying even if you’re running it locally, maybe more so because your nonstandard local environment may have surprises (e.g. an env var you set then forgot) while the remote environment mostly only has stuff you set/installed explicitly, and you can be sloppier (e.g. accidentally running ./deploy when you wanted to run ./test). That part isn’t a GH Actions problem.

arccy

    git commit --allow-empty -m "bump ci"

unless your pipeline does magic with trying to detect changed files

mdaniel

If this is to troubleshoot non-code related failures (perm issues, connection timed out, whatever influences success that doesn't require a code change) then surely the repo's history would benefit from one just clicking "Re-run Job", or its equivalent $(gh ...) invocation, right?

8n4vidtmkvmk

I use Mercurial + hg-git like a weirdo. Not sure if Mercurial supports empty commits, I don't think it does.

pavon

Ah yes, I have a git alias created specifically for the "we don't know what it does until we push it" world of CI:

> yolo = "!git commit --all --amend --no-edit && git push --force #"

nunez

Biggest pet peeve of GHA by a country mile.

hinkley

Re: monorepo

> In GitHub you can specify a "required check", the name of the step in your pipeline that always has to be green before a pull request is merged. As an example, I can say that web-app1 - Unit tests are required to pass. The problem is that this step will only run when I change something in the web-app1 folder. So if my pull request only made changes in api1 I will never be able to merge my pull request!

Continuous Integration is not continuous integration if we don’t test that a change has no deleterious side effects on the rest of the system. That’s what integration is. So if you aren’t running all of the tests because they’re slow, then you’re engaging in false economy. Make your tests run faster. Modern hardware with reasonable test runners should be able to whack out 10k unit tests in under a minute. The time to run the tests goes up by a factor of ~7-10 depending on framework as you climb each step in the testing pyramid. And while it takes more tests to cover the same ground, with a little care you can still almost halve the run time replacing one test with a handful of tests that check the same requirement one layer down, or about 70% moving down two layers.

One thing that’s been missing from most of the recent CI pipelines I’ve used is being able to see that a build is going to fail before the tests finish. The earlier the reporting of the failure the better the ergonomics for the person who triggered the build. That’s why the testing pyramid even exists.

ecosystem

This comment is way too far down the page.

If the unit tests are slow enough to want to skip them, they likely are not unit tests but some kind of service-level tests or tests that are hitting external APIs or some other source of a bad smell. If the slow thing is the build, then cache the artifact keyed off the directory contents so the step is fast if code is unchanged. If the unit tests only run for a package when the code changes, there is a lack of e2e/integration testing. So, what is OP's testing strategy? Caching? It seems like following good testing practices would make this problem disappear.

jcarrano

That is true for most cases, which nowadays is web and backend software. As you get into embedded or anything involving hardware things get slower and you need to optimize.

For example, tests involving read hardware can only run at 1x speed, so you will want to avoid running those if you can. If you are building a custom compiler toolchain, that is slow and you will want to skip it if the changes cannot possibly affect the toolchain.

maccard

I agree hardware should be that quick, but CI and cloud hardware is woefully underpowered unless you actively seek it out. I’ve also never seen a test framework spew out even close to that in practice. I’m not even sure most frameworks would do that with noop tests, which is sad.

hinkley

10 years ago my very testing-competent coworker had us running 4200 tests in 37 seconds. In NodeJS. We should be doing as well that today without a gifted maintainer.

maccard

I've got an i9 and an NVMe drive. running npm test with 10k no-op tests takes 30 seconds, which is much quicker than I expected it to be (given how slow everything else in the node world is).

Running dotnet test on the other hand with 10k empty tests took 95 seconds.

Honestly, 10k no-op tests should be limited by disk IO, and in an ideal world would be 10 seconds.

rustd

Agreed, most of the CI tools don't help in getting feedback early to the developers. I shouldn't have to wait hours for my CI job to complete. Harness is a tool that can reduce build times by caching build artifacts, docker layers and only running a subset of tests that were impacted the by the code change.

androa

GitHub (Actions) is simply not built to support monorepos. Square peg in a round hole and all that. We've opted for using `meta` to simulate monorepos, while being able to use GitHub Actions without too much downsides.

justin_oaks

Which makes me wonder if there is a way to simulate multiple repos while maintaining a mono repo. Or mirror a portion of a monorepo as a single repo.

Obviously this would be a real pain to implement just to fix the underlying problem, but it's an interesting (awful) solution

Imustaskforhelp

hey could you please share the`meta` tool you mentioned , sounds interesting ! couldn't find it on internet [skill issue]

joshka

Guessing it's https://github.com/mateodelnorte/meta googlefu "meta github repo"

Imustaskforhelp

hey thanks!

definitely interesting!

I do wonder if this really solves the author problem because by the looks of it , you just have to run meta command and it would run over each of the sub directory. While at the same time , I think I like it because this is what I think people refer to as "modular monolith"

Combining this with nats https://nats.io/ (hey if you don't want it to be over the network , you could use nats with the memory model of your application itself to reduce any overhead) and essentially just get yourself a really modular monolith in which you can then seperate things selectively (ahem , microservices) afterwards rather easily.

keybored

Why is this so difficult?

1. We apparently don’t even have a name for it. We just call it “CI” because that’s the adjacent practice. “Oh no the CI failed”

2. It’s conceptually a program that reports failure if whatever it is running fails and... that’s it

3. The long-standing principle of running “the CI” after merging is so backwards that that-other Hoare disparagingly called the correct way (guard “main” with a bot) for The Not Rocket Science Principle or something. And that smug blog title is still used to this day (or “what bors does”)

4. It’s supposed to be configured declaratively but in the most gross way that “declarative” has ever seen

5. In the true spirit of centralization “value add”: the local option of (2) (report failure if failed) has to be hard or at the very least inconvenient to set up

I’m not outraged when someone doesn’t “run CI”.

keybored

> We apparently don’t even have a name for it. We just call it “CI” because that’s the adjacent practice. “Oh no the CI failed”

Martin Fowler apparently calls this “continuous build” (the build itself without CI necessarily). And that’s better.

HN

I'll think twice before using GitHub Actions again

I'll think twice before using GitHub Actions again