Self-hosting chronicle: ACME

Over the last few months, as an evening project, I've been setting up a single-instance k3s cluster. There've been tribulations -- by the nature of the project, all self-induced -- but it's been very gratifying to stretch my legs again with Kubernetes.

For the last week or so, though, I've been banging my head against trying to get forgejo to play nicely with step-ca for ACME.

My garden, at this time

At this point, I think it's fair to say I've sorted out my Kubernetes basics:

I've also technically got a git service with forgejo backed by in-cluster Postgres via the cloudnative-pg operator. There's also an instance of kanidm running... but I won't call them "working" until they're playing nicely together.

Before kanidm will talk to forgejo, it wants forgejo to have HTTPS on its internal domain forge.home.internal.

The issue

I initially tried using csi.cert-manager.io, but Forgejo currently loads certificates at boot and doesn't reload them. To go down this route, I'd have to either hack some periodic pod rollover or give up on issuing short-lived certificates. Since the point of this whole exercise is navel-gazing, I opted to try hooking up ACME instead, since Gitea natively supports it and therefore Forgejo does, too.

Unfortunately, I hit a single issue that was very hard to diagnose because step-ca logs were pretty obtuse, and Forgejo's logs may as well have been missing.

# forgejo.log
2025/11/28 01:12:55 cmd/web.go:252:runWeb() [I] Starting Forgejo on PID: 7
2025/11/28 01:12:55 cmd/web.go:114:showWebStartupMessage() [I] Forgejo version: 13.0.2+gitea-1.22.0 built with GNU Make 4.4.1, go1.25.3 : bindata, timetzdata, sqlite, sqlite_unlock_notify
2025/11/28 01:12:55 cmd/web.go:115:showWebStartupMessage() [I] * RunMode: prod
2025/11/28 01:12:55 cmd/web.go:116:showWebStartupMessage() [I] * AppPath: /usr/local/bin/gitea
2025/11/28 01:12:55 cmd/web.go:117:showWebStartupMessage() [I] * WorkPath: /data
2025/11/28 01:12:55 cmd/web.go:118:showWebStartupMessage() [I] * CustomPath: /data/gitea
2025/11/28 01:12:55 cmd/web.go:119:showWebStartupMessage() [I] * ConfigFile: /data/gitea/conf/app.ini
2025/11/28 01:12:55 cmd/web.go:120:showWebStartupMessage() [I] Prepare to run web server
2025/11/28 01:12:55 routers/init.go:115:InitWebInstalled() [I] Git version: 2.49.1, Wire Protocol Version 2 Enabled (home: /data/home)
2025/11/28 01:12:58 cmd/web.go:314:listen() [I] Listen: https://0.0.0.0:3000
2025/11/28 01:12:58 cmd/web.go:318:listen() [I] AppURL(ROOT_URL): https://forge.home.internal/
1.7642923786393366e+09  info    maintenance     started background certificate maintenance      {"cache": "0xc002de2f80"}
1.764292378645172e+09   info    obtain  acquiring lock  {"identifier": "forge.home.internal"}
1.764292378650004e+09   info    obtain  lock acquired   {"identifier": "forge.home.internal"}
1.76429237865009e+09    info    obtain  obtaining certificate   {"identifier": "forge.home.internal"}
1.7642923786506145e+09  info    creating new account because no account for configured email is known to us     {"email": "", "ca": "https://ca.home.internal/acme/acme-http/directory", "error": "open https/acme/ca.home.internal-acme-acme-http-directory/users/default/default.json: no such file or directory"}
1.7642923786506486e+09  info    ACME account has empty status; registering account with ACME server     {"contact": [], "location": ""}


Your sites will be served over HTTPS automatically using an automated CA.
By continuing, you agree to the CA's terms of service.
Please enter your email address to signify agreement and to be notified
in case of issues. You can leave it blank, but we don't recommend it.
1.7642923786562128e+09  info    creating new account because no account for configured email is known to us     {"email": "", "ca": "https://ca.home.internal/acme/acme-http/directory", "error": "open https/acme/ca.home.internal-acme-acme-http-directory/users/default/default.json: no such file or directory"}
1.7642923786927118e+09  info    new ACME account registered     {"contact": [], "status": "valid"}
1.7642923787033544e+09  info    waiting on internal rate limiter        {"identifiers": ["forge.home.internal"], "ca": "https://ca.home.internal/acme/acme-http/directory", "account": ""}
1.7642923787034037e+09  info    done waiting on internal rate limiter   {"identifiers": ["forge.home.internal"], "ca": "https://ca.home.internal/acme/acme-http/directory", "account": ""}
1.764292378703431e+09   info    using ACME account      {"account_id": "https://ca.home.internal/acme/acme-http/account/FwL0kAzEgmnSBLXKA2tZY6DubbSVrdOJ", "account_contact": []}
1.7642923787483509e+09  info    trying to solve challenge       {"identifier": "forge.home.internal", "challenge_type": "http-01", "ca": "https://ca.home.internal/acme/acme-http/directory"}
# step-ca
{"duration":"27.143206ms","duration-ns":27143206,"fields.time":"2025-11-28T01:07:58Z","level":"info","method":"POST","msg":"","name":"ca","nonce":"MTVmakluTUdoNmpOUWlEVjhmNFRqdzRzcFVzN0YzMko","path":"/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc","protocol":"HTTP/2.0","referer":"","remote-address":"10.42.0.85","request-id":"23e92001-cdc5-431c-8df8-d51d6ca62db7","response":"{\"type\":\"http-01\",\"status\":\"pending\",\"token\":\"xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\",\"url\":\"https://ca.home.internal/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc\",\"error\":{\"detail\":\"The server could not connect to validation target\",\"internal\":\"error doing http GET for url http://forge.home.internal/.well-known/acme-challenge/xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR: Get \\\"http://forge.home.internal/.well-known/acme-challenge/xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\\\": dial tcp 10.43.211.36:80: connect: operation not permitted\",\"type\":\"urn:ietf:params:acme:error:connection\"}}","size":322,"status":200,"time":"2025-11-28T01:07:58Z","user-agent":"CertMagic acmez (linux; amd64)","user-id":""}
{"duration":"12.124376ms","duration-ns":12124376,"fields.time":"2025-11-28T01:07:59Z","level":"info","method":"POST","msg":"","name":"ca","nonce":"OFZ4TThoY1dRSGVueHNwZ1pzcDlTc0toT3E2bGRtMVg","path":"/acme/acme-http/authz/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9","protocol":"HTTP/2.0","referer":"","remote-address":"10.42.0.85","request-id":"9a5a96c4-a160-4388-bd5e-0b5392021b40","response":"{\"identifier\":{\"type\":\"dns\",\"value\":\"forge.home.internal\"},\"status\":\"pending\",\"challenges\":[{\"type\":\"http-01\",\"status\":\"pending\",\"token\":\"xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\",\"url\":\"https://ca.home.internal/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc\",\"error\":{\"type\":\"urn:ietf:params:acme:error:connection\",\"detail\":\"The server could not connect to validation target\"}}],\"wildcard\":false,\"expires\":\"2025-11-29T01:07:58Z\"}","size":465,"status":200,"time":"2025-11-28T01:07:59Z","user-agent":"CertMagic acmez (linux; amd64)","user-id":""}
{"duration":"13.676463ms","duration-ns":13676463,"fields.time":"2025-11-28T01:07:59Z","level":"info","method":"POST","msg":"","name":"ca","nonce":"Wk9aT0JmNHBXcmhDcUd6dUxjTGh6NUZEU1NDV2QyWHY","path":"/acme/acme-http/authz/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9","protocol":"HTTP/2.0","referer":"","remote-address":"10.42.0.85","request-id":"d778d1e3-e05f-4669-af4f-f0ca5f489e02","response":"{\"identifier\":{\"type\":\"dns\",\"value\":\"forge.home.internal\"},\"status\":\"pending\",\"challenges\":[{\"type\":\"http-01\",\"status\":\"pending\",\"token\":\"xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\",\"url\":\"https://ca.home.internal/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc\",\"error\":{\"type\":\"urn:ietf:params:acme:error:connection\",\"detail\":\"The server could not connect to validation target\"}}],\"wildcard\":false,\"expires\":\"2025-11-29T01:07:58Z\"}","size":465,"status":200,"time":"2025-11-28T01:07:59Z","user-agent":"CertMagic acmez (linux; amd64)","user-id":""}
{"duration":"13.984491ms","duration-ns":13984491,"fields.time":"2025-11-28T01:07:59Z","level":"info","method":"POST","msg":"","name":"ca","nonce":"RGwwc0JXdzVFSFlydWNRbkhVNTV2NXRNS2ZTMDFDaUw","path":"/acme/acme-http/authz/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9","protocol":"HTTP/2.0","referer":"","remote-address":"10.42.0.85","request-id":"5bad66a1-2c64-4c3f-a3ea-97246668072b","response":"{\"identifier\":{\"type\":\"dns\",\"value\":\"forge.home.internal\"},\"status\":\"pending\",\"challenges\":[{\"type\":\"http-01\",\"status\":\"pending\",\"token\":\"xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\",\"url\":\"https://ca.home.internal/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc\",\"error\":{\"type\":\"urn:ietf:params:acme:error:connection\",\"detail\":\"The server could not connect to validation target\"}}],\"wildcard\":false,\"expires\":\"2025-11-29T01:07:58Z\"}","size":465,"status":200,"time":"2025-11-28T01:07:59Z","user-agent":"CertMagic acmez (linux; amd64)","user-id":""}

What's happening?

Well, honestly, there were a few things I did wrong that I had to fix, first, that took a few tries:

bef5e4f (HEAD) fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Enable ACME for Forgejo

It wasn't (only) DNS

My first instinct was that I had messed something up with DNS, which I confirmed:

core@phys1$ kubectl get svc -n forgejo forgejo -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
192.168.40.0

core@phys1$ nslookup forge.home.internal
Server:         192.168.40.53
Address:        192.168.40.53:53


Name:   forge.home.internal
Address: 192.168.41.2

Immediately, a couple of things odd. First, my forgejo service is apparently at 192.168.40.0, which is the network identifier for my subnet range 192.168.40.0/21. Then, the IP address reported by kubectl and from my DNS server aren't the same!

I don't know of any hokum in my network or Kubernetes that would actually make an issue of the 192.168.40.0 IP address. Still, having a healthy superstition (and since MetalLB makes it so easy), I disabled IP assignments ending in .0 and .255:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: lan-ips
  namespace: metallb-system
spec:
  addresses:
  - 192.168.40.0/21
  autoAssign: true
  avoidBuggyIPs: true

For the DNS issue, it looks external-dns doesn't cleanup after itself. I had duplicate DNS records in Technitium, apparently from a previous IP assignment to forgejo. Deleting the old records fixed the mismatch -- I guess I have an infrequent chore now, and so the gift of a future project in automating it.

It was kind of also DNS, but different

But wait -- that's not even the DNS step-ca is looking at! For this, I needed to resolve against the cluster-internal coredns.

core@phys1$ kubectl get svc -n forgejo forgejo -o jsonpath='{.spec.clusterIP}'
10.43.211.36

core@phys1$ $ kubectl exec -it -n smallstep-system smallstep-step-certificates-0 -- nslookup forge.home.internal
Server:         10.43.0.10
Address:        10.43.0.10:53

** server can't find forge.home.internal: NXDOMAIN

** server can't find forge.home.internal: NXDOMAIN

Right. I don't have any automation that configures DNS records in coredns for *.home.internal, so it doesnt resolve these FQDNs.

I'd hit this earlier when setting up step-ca, so I cribbed the same config for my Corefile:

rewrite stop {
    name exact forge.home.internal. forgejo.forgejo.svc.cluster.local
    answer name forgejo.forgejo.svc.cluster.local. forge.home.internal
}

Now, Step CA could resolve my Forgejo service via DNS:

core@phys1$ kubectl exec -it -n smallstep-system smallstep-step-certificates-0 -- nslookup forge.home.internal | tail -3
Name:   forge.home.internal
Address: 10.43.211.36

Tangent: snag with k3s

Along the way, I took a detour to replace the coredns bundled with k3s with one I could install via Flux since I was maintaining the ConfigMap by hand, anyway. I hit one small nuisance: running k3s with --disable=coredns also disables the controller logic that maintains the NodeHosts file -- allowing services to resolve the node's hostname (e.g. phys1.prod.home.internal).

I've resigned myself to maintaining this by hand for now, too, but it would be nice for k3s to allow enabling this logic separately, or else to have it separately as a controller in my cluster.

Pods unreachable

OK, so now the singleton pods at forge.home.internal and ca.home.internal can resolve each other over DNS, but there's no change in the logs between Forgejo and Step CA.

After some more curling from Step CA, I realized that enabling SSL and ACME in Forgejo had broken the readinessProbe, and that the Helm chart didn't provide a way to configure the pod/service to expose port 80 for HTTP redirect -- so no traffic was making it into my Forgejo pod.

# Added to the Flux HelmRelease yaml:
  postRenderers:
    - kustomize:
        patches:
          - target:
              version: v1
              kind: Deployment
              name: forgejo
            patch: |-
              - op: add
                path: /spec/template/spec/containers/0/ports
                value: [ { "containerPort": 80, name: "httpredirect", protocol: "TCP" } ]

I also noticed Forgejo has two distinct life stages when ACME is enabled. It first yields entirely to certmagic for the initial ACME handshake and challenge routine, with certmagic hosting listeners; once that completes, it hosts its own listeners, including the certmagic handler on port 80. Forgejo isn't available on port 443 at all until the initial ACME handshake succeeds -- which makes sense, but does make it harder to tell if I'd gotten the port mappings in my pod/service configurations correct.

I should also point out that I'm trying to get HTTP-01 to succeed, but in desperation I also tried TLS-ALPN which also didn't work.

At long last, they see each other

I had a small breakthrough. Trying once more to curl from step-ca, I could see the HTTP challenge answer hosted in Forgejo.

$ kubectl exec -it -n smallstep-system smallstep-step-certificates-0 -- curl http://forge.home.internal/.well-known/acme-challenge/Wj2tEDJz4PSnapXyTeN6hmCe2bmiZ623
Wj2tEDJz4PSnapXyTeN6hmCe2bmiZ623.tcRvgbkRoqepBca1PWppKpIArxCybG3SKecujL67ess

Progress! But looking at logs in Forgejo and Step CA, the ACME handshake was still getting stuck at the same point.

More friction

At this point, I was really scratching my head. I'd had debug logs enabled for both services since before the first events of this blog post. They weren't really helping, so I was thinking about hacking in my own logs to see more of what was taking place.

Fortunately, I found an unmerged PR that had already implemented the logging changes I wanted. I copied those changes onto master and used the in-repo Dockerfile and a temporary podman run -d -p 5000:5000 registry:latest to host my debug container image, and got some new logs:

{"duration":"27.143206ms","duration-ns":27143206,"fields.time":"2025-11-28T01:07:58Z","level":"info","method":"POST","msg":"","name":"ca","nonce":"MTVmakluTUdoNmpOUWlEVjhmNFRqdzRzcFVzN0YzMko","path":"/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc","protocol":"HTTP/2.0","referer":"","remote-address":"10.42.0.85","request-id":"23e92001-cdc5-431c-8df8-d51d6ca62db7","response":"{\"type\":\"http-01\",\"status\":\"pending\",\"token\":\"xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\",\"url\":\"https://ca.home.internal/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc\",\"error\":{\"detail\":\"The server could not connect to validation target\",\"internal\":\"error doing http GET for url http://forge.home.internal/.well-known/acme-challenge/xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR: Get \\\"http://forge.home.internal/.well-known/acme-challenge/xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\\\": dial tcp 10.43.211.36:80: connect: operation not permitted\",\"type\":\"urn:ietf:params:acme:error:connection\"}}","size":322,"status":200,"time":"2025-11-28T01:07:58Z","user-agent":"CertMagic acmez (linux; amd64)","user-id":""}
{"duration":"12.124376ms","duration-ns":12124376,"fields.time":"2025-11-28T01:07:59Z","level":"info","method":"POST","msg":"","name":"ca","nonce":"OFZ4TThoY1dRSGVueHNwZ1pzcDlTc0toT3E2bGRtMVg","path":"/acme/acme-http/authz/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9","protocol":"HTTP/2.0","referer":"","remote-address":"10.42.0.85","request-id":"9a5a96c4-a160-4388-bd5e-0b5392021b40","response":"{\"identifier\":{\"type\":\"dns\",\"value\":\"forge.home.internal\"},\"status\":\"pending\",\"challenges\":[{\"type\":\"http-01\",\"status\":\"pending\",\"token\":\"xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\",\"url\":\"https://ca.home.internal/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc\",\"error\":{\"type\":\"urn:ietf:params:acme:error:connection\",\"detail\":\"The server could not connect to validation target\"}}],\"wildcard\":false,\"expires\":\"2025-11-29T01:07:58Z\"}","size":465,"status":200,"time":"2025-11-28T01:07:59Z","user-agent":"CertMagic acmez (linux; amd64)","user-id":""}
{"duration":"13.676463ms","duration-ns":13676463,"fields.time":"2025-11-28T01:07:59Z","level":"info","method":"POST","msg":"","name":"ca","nonce":"Wk9aT0JmNHBXcmhDcUd6dUxjTGh6NUZEU1NDV2QyWHY","path":"/acme/acme-http/authz/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9","protocol":"HTTP/2.0","referer":"","remote-address":"10.42.0.85","request-id":"d778d1e3-e05f-4669-af4f-f0ca5f489e02","response":"{\"identifier\":{\"type\":\"dns\",\"value\":\"forge.home.internal\"},\"status\":\"pending\",\"challenges\":[{\"type\":\"http-01\",\"status\":\"pending\",\"token\":\"xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\",\"url\":\"https://ca.home.internal/acme/acme-http/challenge/JdMZxRmAKzJWumgQWzP6cfZwrIYg2lF9/6ab1NP34gRGexFim3uWQ43lzeII8LKyc\",\"error\":{\"type\":\"urn:ietf:params:acme:error:connection\",\"detail\":\"The server could not connect to validation target\"}}],\"wildcard\":false,\"expires\":\"2025-11-29T01:07:58Z\"}","size":465,"status":200,"time":"2025-11-28T01:07:59Z","user-agent":"CertMagic acmez (linux; amd64)","user-id":""}

Two interesting things here. First, the error in the internal field of the first log line, which I've copied below for easier reading.

\"internal\":\"error doing http GET for url http://forge.home.internal/.well-known/acme-challenge/xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR: Get \\\"http://forge.home.internal/.well-known/acme-challenge/xpCjrWIapTHACHR6qSWmv4I5LTVsB0kR\\\": dial tcp 10.43.211.36:80: connect: operation not permitted\"

The second observation was less apparent, but perhaps only because I was so fuzzy-headed by this point. I'd noticed pretty early on that the first log line had path: /acme/acme-http/challenge/<account>/<challenge> while subsequent log lines had path: /acme/acme-http/authz/<account>. Now, though, those log lines didn't have an internal field -- not that they needed to have one if Forgejo was polling for completion. Still, nothing in the logs indicated that the ACME challenge was ever being re-validated after the initial attempt.

Then I remembered that I'd passed over this other PR, and it occurred to me that meant smallstep-ca doesn't do any automatic re-validation. Comparing code to what was happening in my cluster, I realized that sure enough:

1. Forgejo sends a "ready" request to Step CA's challenge endpoint 2. Forgejo wasn't available via the service until seconds later because its readiness probe had a gracePeriod which I'd cribbed from the Helm default 3. Step CA dutifully tried to validate the challenge 4. Because Forgejo wasn't actually reachable, Step CA would get connect: operation not permitted 5. Forgejo happily started polling Step CA's authorization endpoint, waiting for a success signal

All pods moving forward, happily confident that the other party needed to advance the handshake.

Oddly, there are actually a couple of outer-loop retries in Forgejo, one courtesy of certmagic and another courtesy of the underlying implementation in acmez.

The solution

For now, I've removed the readinessProbe for Forgejo. This is not very HA. It would be nice to be able to configure a cadence of retry POST requests against the challenge endpoint, since that's what the ACME spec says. Ideally, Smallstep will merge the PR to implement server-side retries, though that seems unlikely since the last movement was in 2020.