I Built a Full PKI in 48 Hours to Write a Better RFP

I was about to write an RFP for a bank's PKI infrastructure. Certificate Authority, Key Management System, HSM integration, lifecycle management — the full stack. And I realized I had a problem: I'd evaluated these products from vendor slides and documentation, but I'd never built one end-to-end myself.

How do you write technical requirements for something you've never assembled? How do you know which questions to ask vendors if you've never hit the edge cases yourself? You can't. So instead of writing the RFP first, I built the thing.

Two days later, I had a fully functional PKI running on a hardened VPS: Root CA, Issuing CA, OCSP responder, CRL distribution, ACME server for automated enrollment, mTLS authentication, HA failover with PostgreSQL replication, and monitoring with Telegram alerts. Every component. Every integration point. Every failure mode I could simulate.

And thanks to AI as my engineering co-pilot, what would have taken weeks of trial and error took 48 hours of focused building. Not because AI replaced the thinking — but because it handled the typing, the config file syntax, the troubleshooting loops, while I made the architecture decisions.

Why Build Before You Specify?

Here's what I learned by building that I could never have learned from a vendor demo:

→ EJBCA Community Edition can't do interactive CLI in Docker. The ca init command uses Java's System.console() which returns null in a container. No vendor doc mentions this. I had to generate the CA keys with OpenSSL externally and import via PKCS#12. That's a real-world constraint that changes your deployment procedure.
→ OCSP over HTTPS causes a chicken-and-egg problem. If your PKI is down, HTTPS can't verify itself. RFC 5280 says CRL/OCSP should be HTTP — but Caddy auto-redirects everything to HTTPS. Took real debugging to solve.
→ HA failover actually works — but node recovery takes 2 minutes. I killed Node 1, Caddy routed to Node 2 instantly, zero downtime. But when Node 1 came back, the crypto token needed re-activation. That's an operational detail that belongs in your SOP, not discovered during an incident.
→ Docker network segmentation is critical but tricky. Internal networks block internet access (good for the CA database), but they also block the host's Caddy from reaching containers. You need multi-homed containers with static IPs. The IP addressing scheme matters.

None of this is in any vendor's sales deck. All of it goes into my RFP.

The Architecture I Ended Up With

MKP Root CA G1 (ECC P-384, 20-year validity, offline)
  └── MKP Issuing CA G1 (ECC P-256, 10-year validity, online)
       ├── TLS Server Certificates
       ├── TLS Client Certificates (mTLS)
       └── OCSP Signing Certificates

Infrastructure:
  5 Docker networks (isolated, no cross-talk)
  2 EJBCA nodes (active/passive HA)
  2 PostgreSQL instances (streaming replication)
  1 CRL Distribution Point (nginx)
  1 ACME server (step-ca)
  Caddy reverse proxy (mTLS + Let's Encrypt)

The HA Proof

The moment that justified the entire exercise — killing a CA node mid-operation:

failover test

$ docker stop ejbca-node1

ejbca-node1

$ curl https://pki.mykeypair.be/ejbca/publicweb/healthcheck/ejbcahealth

ALLOK

→ Zero downtime. Caddy detected the failure and routed to Node 2.

→ Database replication: still streaming, zero lag.

Now I know exactly what HA looks like in practice — not in a vendor whitepaper. I know the recovery time (2 minutes). I know the failure modes. I know what to put in the DR test plan.

mTLS: Seeing Zero Trust Work

mTLS

$ curl https://test.mykeypair.be/ # no cert

curl: (56) Connection reset

$ curl --cert client.p12 https://test.mykeypair.be/

mTLS verified! You are authenticated by MKP PKI.

Subject: CN=Ismail Zemouri,O=MYKEYPAIR,C=BE

Not a 403. Not a login page. Connection reset. The server won't even talk to you without a valid certificate. That's what Zero Trust looks like when you actually implement it.

The AI Factor

I need to be honest about the role AI played here. I used Claude as my engineering co-pilot throughout the build. It wrote Docker configs, debugged Caddy routing issues, generated compliance documentation, and handled the repetitive parts — while I made every architecture decision, chose algorithms, designed the network topology, and decided what to build and why.

The result: what would normally take a solo engineer 2-3 weeks of evenings took 48 hours of intense, focused work. The AI didn't replace expertise — it amplified it. I still needed to know what "right" looks like. But I didn't need to remember every OpenSSL flag or Docker Compose syntax.

This is the future of security engineering. Not AI replacing architects, but architects moving faster because the implementation friction is gone. The thinking still matters. The typing doesn't.

What This Means for the RFP

My RFP is now grounded in reality, not theory:

✓ I know which EJBCA features require Enterprise vs Community edition
✓ I know what HA actually requires in terms of database replication
✓ I know the operational procedures that must exist before go-live
✓ I know what Certificate Lifecycle Management needs to cover beyond basic issuance
✓ I know the security hardening baseline a PKI server must meet (Lynis 77+)
✓ I can challenge vendor claims because I've seen what works and what breaks

The best technical requirements come from people who've built the thing. Not from people who've read about it.

The Takeaway

If you're responsible for specifying PKI infrastructure for your organization — build a proof of concept first. Not to ship it. To understand it. Every edge case you discover in the POC is a requirement you'd otherwise miss in the RFP. Every failure mode you simulate is an SOP you need to write.

And if you don't have 48 hours to spare, I've already done it. I can help you map your requirements to reality — because I've lived the gap between vendor promises and production deployment.