Homelabs, am I right?

You might have heard that a homelab is about the gear. It can be. I love buying server hardware, finding deals on refurbished , and frankensteining a chassis together from parts until it posts. That part is fun. But the reason the lab stays interesting, years in, is everything that happens after the hardware works.

A homelab is a playground where the stakes are low enough to experiment and high enough to learn. You can install a new operating system, break it, blow it away, and start over. With a like you can spin up a server in minutes, test something, tear it down, and try a different approach. You can bash your head against a problem for hours, solve it, and walk away knowing that most people would have called support.

My lab also does useful work. I self-host audiobooks so I can listen on my commute to titles I own, , on infrastructure I control. I run Plex so I can tell my kids “go check the server” when they want a movie, and they have some idea that their dad made that possible. My father is a long-distance truck driver, and he calls me every time the server goes down. The books give us something to talk about on his long hauls, and that connection rides on a service I built and maintain in a closet.

I have a and custom and because I like having control of my network. I keep off my services . I block ads and trackers at the DNS level. I have not seen an ad on my phone in years.

And yes, the lab is where I practice what I preach. I segment my network with VLANs, I maintain a , I write , and I keep diagrams that another engineer could follow. I built the website you are reading right now on this stack, domain and all, because I wanted a business site I could stand up in hours at no cost. That is the spirit of the lab: useful, documented, mine.

The DNS story

I was running as my recursive resolver. One morning, DNS across the entire home network failed. Every device, every service, nothing resolving. I suspect a configuration change I had made the night before looked syntactically valid but broke upstream forwarding; cached entries kept working until their expired, and then resolution stopped.

This happened to land on the day of an interview with a company.

I spent a few hours diagnosing the failure, tracing the resolution path, restarting the service, validating upstream connectivity, and getting the household back online. I made the interview on time. The interviewer asked how my day was going. I said “DNS issues. It is always DNS issues.” The interviewer laughed, and the rest of the hour turned into a walkthrough of the incident: what broke, how I triaged it, what I fixed, and what I would do differently. Not a single prepared question. A live outage and recovery told the interviewer more than any whiteboard exercise could.

I got that job. It was a platform engineering role, and I spent a couple of years getting paid to troubleshoot computers, scripts, automation frameworks, cloud tenants, and at a depth that often led to conversations with the team or an because I had found a legitimate bug. Kernel-level troubleshooting, sometimes. I loved it.

The lab did not get me that job by itself. But it gave me a place to build the instincts and the stories that made the interview feel like a conversation between two people who fix things for a living.

What the lab does

The design starts with what the system has to support and who has to live with it. For my lab, the main jobs are:

  1. Run household services that fade into the background.
  2. Host experiments in a place where experiments stay contained.
  3. Publish selected services through a controlled edge.
  4. Keep a management path alive when the routed network changes.
  5. Produce diagrams, tables, and that another engineer could follow.

That last job matters. A lab documented only in muscle memory teaches improvisation. A lab with diagrams, decision records, and validation notes teaches design.

The first diagram is trust

Before I care about , switch ports, or syntax, I care about . The first drawing I want is a map of where trust changes and where repair access lives.

Trust boundaries and data flows

Dashed borders are trust boundaries. Each zone carries a different trust level. The repair actor operates out-of-band, outside all routed zones.

Trust boundaries and data flows

Drag to pan, scroll or pinch to zoom, or use the controls.

Dashed borders are trust boundaries. Each zone carries a different trust level. The repair actor operates out-of-band, outside all routed zones.

This diagram is a trust model. Each dashed border is a trust boundary; data crossing one of those borders changes privilege level and deserves a firewall rule or proxy decision. The reconnaissance actor shows what the internet already knows about your lab from and records. The repair actor sits outside every routed zone because the moment you lose the management plane, you lose the ability to fix everything else.

Router-on-a-stick, with a lifeline

The lab did not start here. The first routing design was a dedicated box with a dual-port 10 Gbps NIC, one side for and the other for . It worked well, and I liked how it separated routing from the rest of the infrastructure. Then one of the two ports failed.

It did not fail all at once. Over the course of a few months, the connection would intermittently drop in an almost imperceptible way. I chased the problem through cables, switch ports, and driver settings before discovering the NIC itself was dying. Replacing the fancy dual-port, autonegotiating NIC would not have been cheap, and I had decided that was the end of the experiment. I regretfully retired the pfSense box and switched back to my regular old Google Nest WiFi router for a while. I noodled with the idea of building a new router, or buying a “real” one, and eventually landed on a more interesting solution:

I moved routing into the hypervisor instead of buying another card. runs as a on , and the physical network reduces to a single over another 10 Gbps link on the main server. All ride that trunk, and the firewall VM handles inter-VLAN routing. Less hardware to break. One fewer box in the closet.

The trade-off is real: routing now depends on the hypervisor. If Proxmox is down, the network is down. That is a real trade-off, and I chose it anyway. The mitigation is the lifeline: a separate that does not depend on the routed network. It might be a console server, a dedicated admin VLAN, a that terminates on the hypervisor, or a documented recovery procedure taped to the inside of the closet door. The form changes. The principle stays the same: the repair path belongs beside the routed path, and the design should name both.

The matrix is the explanation layer

Firewall rules are implementation. A (PPSM) matrix is the explanation layer.

The matrix should tell a reader why traffic exists before they look at a firewall GUI. This is the kind of shareable version I would put in a design package:

Source roleDestination roleProtocolPortPurposeReview trigger
InternetPublic edgeTCP80, 443Publish selected web services and renew certificatesAny new public DNS name
Public edgeService appTCPApp portSend approved requests to the serviceNew app, new service path, or changed authentication model
Admin workstationManagement planeTCP, admin UIMaintain hypervisor, firewall, and switchNew admin user or new management host
Service zoneDNS resolverUDP/TCP53Resolve internal and external namesResolver change or new zone
Service zoneInternetTCP443Package updates, APIs, and outbound integrationsNew vendor dependency
Household devicesMedia serviceTCPMedia app portLocal playback and casting supportNew media platform or new device class
IoT devicesManagement controllerTCPController portsDevice adoption and status checksNew controller or wireless redesign

The working version would have real interfaces, addresses, aliases, and rule IDs. This version shows the discipline: every path has a source, destination, protocol, port, purpose, and review trigger.

That review trigger is the part people skip. It is also the part that keeps a firewall from turning into sediment. If a rule has no trigger, nobody revisits it. If nobody revisits it, the firewall accumulates rules the way a garage accumulates boxes: each one made sense at the time.

A blank row for your own lab or facility:

Source roleDestination roleProtocolPortPurposeReview trigger
(your zone)(target zone)TCP/UDP(port)(why this traffic exists)(what change would make you revisit this rule)

Start with one row per firewall rule you can explain. If you find a rule you cannot explain, that is the row that matters most.

Look at that first row again: Internet → Public edge, port 443. If you put a reverse proxy in the public edge, that row is the only inbound WAN rule you need. One port, one destination, and the firewall can default deny everything else inbound. Your Minecraft server, your file shares, your management interfaces: none of them need a hole punched through the firewall because none of them face the internet directly. The proxy is the only thing that does.

That keeps the firewall simple. It also keeps vulnerability management much less severe, because patching one proxy is a different problem than patching every service that used to have its own forwarded port. Adding a new means adding a proxy route, not a new firewall rule.

The reverse proxy is the front porch

The proxy earns its place by being the decision point that skips. It terminates with Let’s Encrypt certificates, matches each request’s hostname against the on the cert, and makes a routing choice: forward to the right internal service, or refuse and log the attempt. Access logs live in one place.

Public request through the edge

A public DNS name is discoverable. The proxy decision should be simple to explain: route the approved request, refuse the unknown one, and leave evidence.

Public request through the edge

Drag to pan, scroll or pinch to zoom, or use the controls.

A public DNS name is discoverable. The proxy decision should be simple to explain: route the approved request, refuse the unknown one, and leave evidence.

During a routine audit of my own lab, I found that the reverse proxy was routing public traffic to management interfaces, including the login page. Every subdomain with a certificate appeared in logs, and tools could list them all. I had set up the proxy for convenience, and convenience is what the internet got too.

That audit changed the design:

  • Management interfaces moved to internal-only access paths.
  • Public services carry their own authentication and logging.
  • The proxy routes a narrow set of named service paths.
  • Edge logs show what crossed the boundary.
  • The PPSM gained a review trigger for every new public DNS name.

If a public DNS name has a certificate, certificate transparency makes it discoverable. Security architecture starts when the design names what crosses the edge, who can use it, and how anyone will know.

Storage is where trade-offs get real

You may have heard the 3-2-1 backup rule: three copies, two media types, one offsite. It is good advice. I follow it for the data that matters. And I break it, on purpose, for the data that does not.

The media array in this lab is a striped multi-disk pool with no redundancy. No mirror, no parity, no second copy. A drive failure takes the pool with it. I chose that layout because hard drives are expensive and the only thing on that array is audiobooks and movies for my family. Paw Patrol is important to my six-year-old daughter, but it is not critical data. Every title can be re-ripped or re-downloaded. More disk space, accepted risk, and a clear-eyed decision about what “loss” means for this workload.

That is the kind of trade-off that separates a design from a shopping list. The interesting question is not which to pick. It is what happens when a drive fails, and whether the answer is something you can live with.

The irreplaceable data gets treated differently. Configurations, service data, databases, keys, diagrams, and are the things that make the system mine. Those live on a separate pool with redundancy, automated , and offsite backups. The decision record names the risk I accepted and the mitigations that make it tolerable:

  • Media pool: striped, no redundancy, re-downloadable content. Accepted risk.
  • System pool: redundant, snapshotted, backed up. Recovery path documented.
  • Monitor disk health and capacity on both.
  • Revisit the design when the hardware budget or workload changes.

That is the kind of I want in a client handoff. A storage decision should say more than what to buy. It should name the choice, the failure mode, and what happens when something breaks.

What this says about how I design

The lab has been through a dead NIC, a DNS outage on interview day, a striped array I chose to let fail, and a reverse proxy that was handing management interfaces to the internet. Every one of those stories started with a decision I made and ended with a diagram, a table, or a runbook that made the next decision easier.

That is the practice. Draw the trust boundaries before you pick the hardware. Write the PPSM row before the firewall rule. Keep a repair path that works when the thing you are repairing is the network itself. And write down the trade-off while the reasons are still fresh, because six months from now you will be a different engineer staring at your own work.

I built this website on the same stack, domain and all, because I wanted proof that the lab produces real things. The next piece will walk through how that works: the build, the deployment, the monitoring, and the part where my six-year-old’s Paw Patrol library and my dad’s audiobooks ride the same infrastructure as the site you are reading now.


← All writing