Learn Zig Series (#99) - Mini Project: DNS-over-HTTPS Proxy

What will I learn?

Why the operating system still asks for names in plaintext UDP long after the rest of your traffic went encrypted -- and how a small local proxy quietly fixes that for every app on the machine;
The one fact that makes this whole project almost embarrassingly small: a DNS-over-HTTPS body is the exact same bytes as a classic DNS packet (episode 82), just carried inside an HTTP POST (episode 84) over TLS (episode 86);
How to stand up a UDP listener on port 53 that speaks the legacy protocol (episode 81) and forwards each query upstream without parsing a single DNS field;
How Zig's error sets and a small Resolver struct keep the forwarding path honest about every way an upstream request can fail;
Where we do have to crack open the DNS wire format after all -- a TTL-aware cache -- and why the transaction ID is the one byte-pair you must never cache;
How this "translate one transport into another" shape compares to what you'd build in C, Rust, or Go.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

Learn Zig Series (#99) - Mini Project: DNS-over-HTTPS Proxy

Here we go ;-) Time to point the networking toolkit at a different shape of problem, exactly as I promised at the end of the chat arc. The chat server was a thing humans sit in front of and talk through. This one is a thing that sits quietly between two protocols and translates, with no human in the loop at all -- a proxy. And it scratches a real itch: the moment your browser loads this page, almost every byte it sends is encrypted, but the very first thing it did -- asking "what is the IP for hive.blog?" -- went out as plaintext UDP that anyone on your network path can read, and worse, tamper with. That is DNS, and it has been leaking your browsing history in the clear since 1987.

DNS-over-HTTPS (DoH, defined in RFC 8484) closes that gap by wrapping the DNS query in an ordinary HTTPS request to a resolver like Cloudflare or Google. Some apps speak DoH natively now, but most of your machine -- system utilities, older programs, that one weird binary from 2014 -- still only knows how to ask the old way. So we build the bridge: a tiny local proxy that listens in the legacy plaintext protocol and forwards over the encrypted one. Every app on the box keeps doing exactly what it always did, and their name lookups silently become private. That's the kind of leverage I like -- fix one small thing in one place, and the whole system benefits without knowing it.

The one fact that makes this project small

Before a line of code, let me hand you the insight the entire project rests on, because it's the reason a "DNS-over-HTTPS proxy" is a weekend afternoon and not a month. Here it is: the body of a DoH request is the identical byte-for-byte DNS message we already built in episode 82. RFC 8484 did not invent a new query format. It took the classic RFC 1035 wire format -- the header, the question, the length-prefixed labels, all of it -- and said "put those exact bytes in the body of an HTTP POST, set Content-Type: application/dns-message, and you're done." The response comes back the same way: the HTTP body is a standard DNS response packet.

To make it concrete, here is the same message living in both worlds -- classic UDP on the left, DoH on the right -- and the payload highlighted is identical:

  Classic DNS over UDP           DNS-over-HTTPS (RFC 8484)
  --------------------           -------------------------
  [ UDP header        ]          POST /dns-query HTTP/1.1
  [ DNS message bytes ]  <---->   Host: cloudflare-dns.com
    id | flags | quest.          Content-Type: application/dns-message
    | answer | authority         Accept: application/dns-message
    | additional         )       Content-Length: 33
                                  <blank line>
                                 [ DNS message bytes ]  <-- same bytes
                                   id | flags | quest.
                                   | answer | ...

Chew on what that means for our proxy. A legacy app hands us a DNS query as a blob of UDP bytes. To forward it over DoH, we do not parse it, we do not understand it, we do not care whether it's asking for an A record or a TXT record. We take the blob, POST it, and get a blob back. Then we send that blob to the app over UDP. The app parses it, none the wiser that it made a round trip through TLS and HTTP on the way. For the core proxy, DNS is an opaque payload. All that networking machinery from episodes 82 through 86 -- the DNS format, HTTP/1.1, TLS -- we get to reuse the shape of without re-deriving any of it. Having said that, let's stand up the two ends.

The listening end: a UDP relay on port 53

The front door is a UDP socket bound to the DNS port, straight out of episode 81. Real DNS lives on port 53, which is privileged (below 1024), so a production build needs CAP_NET_BIND_SERVICE (episode 71) or a run as root; for development I bind 5353 so no special rights are needed and you can point a test client at it with dig @127.0.0.1 -p 5353.

const std = @import("std");
const posix = std.posix;

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const alloc = gpa.allocator();

    // Bind a UDP socket. 5353 in dev (unprivileged); 53 needs CAP_NET_BIND_SERVICE.
    const addr = try std.net.Address.parseIp4("127.0.0.1", 5353);
    const sock = try posix.socket(posix.AF.INET, posix.SOCK.DGRAM, posix.IPPROTO.UDP);
    defer posix.close(sock);
    try posix.bind(sock, &addr.any, addr.getOsSockLen());

    var resolver = Resolver.init(alloc, .cloudflare);
    defer resolver.deinit();

    var buf: [512]u8 = undefined; // classic DNS/UDP caps a message at 512 bytes
    while (true) {
        // Who asked, and what did they ask? recvfrom gives us both.
        var src: posix.sockaddr = undefined;
        var src_len: posix.socklen_t = @sizeOf(posix.sockaddr);
        const n = posix.recvfrom(sock, &buf, 0, &src, &src_len) catch continue;

        // buf[0..n] is an opaque DNS query. Forward it, get an opaque answer.
        const answer = resolver.resolve(buf[0..n]) catch |err| {
            std.log.warn("upstream failed: {s}", .{@errorName(err)});
            continue; // drop this query; the client will retry
        };
        defer alloc.free(answer);

        _ = posix.sendto(sock, answer, 0, &src, src_len) catch {};
    }
}

The whole loop is receive, forward, reply, and the shape is deliberately dumb: one datagram in, one datagram out, to the exact source address recvfrom reported. Note the two spots where a failure is a shrug, not a crash. If forwarding fails we continue -- DNS clients already expect packet loss over UDP and will retry, so a dropped query is the mildest possible degradation. And the classic 512-byte cap on buf is not me being stingy: the original DNS-over-UDP spec really does limit a message to 512 bytes (the EDNS0 extension raises it, but 512 is the floor every client honours). One fixed stack buffer, no allocation on the hot receive path. The only heap work per query happens downstream, in the resolver.

The forwarding end: the DoH POST

Now the interesting half. resolver.resolve takes the opaque query bytes and has to turn them into an HTTPS POST, read the body back, and return it. This is where episodes 84 (HTTP/1.1) and 86 (TLS) would be doing their thing by hand -- but the standard library already stitches those together for us in std.http.Client, so I'll use it and point out that under the hood it is the TLS handshake and HTTP framing we built by hand in those episodes. Reusing the polished version is not cheating; knowing what it's made of is the point.

// Send the opaque DNS query as an RFC 8484 DoH POST and return the response body,
// which is itself a valid DNS response packet. Caller owns the returned slice.
fn forwardDoH(alloc: std.mem.Allocator, url: []const u8, query: []const u8) ![]u8 {
    var client = std.http.Client{ .allocator = alloc };
    defer client.deinit();

    const uri = try std.Uri.parse(url);
    var hdr_buf: [4096]u8 = undefined;

    var req = try client.open(.POST, uri, .{
        .server_header_buffer = &hdr_buf,
        .extra_headers = &.{
            .{ .name = "content-type", .value = "application/dns-message" },
            .{ .name = "accept", .value = "application/dns-message" },
        },
    });
    defer req.deinit();

    req.transfer_encoding = .{ .content_length = query.len };
    try req.send();
    try req.writeAll(query); // the DNS query bytes ARE the request body
    try req.finish();
    try req.wait();

    if (req.response.status != .ok) return error.UpstreamFailed;

    // The response body IS the DNS answer. No parsing, just hand it back.
    return req.reader().readAllAlloc(alloc, 64 * 1024);
}

Look how little DNS knowledge lives in here: none. We set two headers that say "this body is a DNS message and I want one back", we write the query bytes as the body, and we read the body of the reply. The two content types are the entire DoH contract -- get them right and any compliant resolver answers; get them wrong and you get a 415 and a bad afternoon. The readAllAlloc cap of 64 KiB is a deliberate ceiling: a DNS answer over HTTPS can exceed the old 512-byte UDP limit (that's part of why DoH is nice), but it will never be megabytes, so we bound it and refuse anything absurd rather than letting a hostile or broken upstream balloon our memory.

Type system and error handling: the `Resolver` struct

The main loop referenced a Resolver with an init, a resolve, and a deinit. Wrapping the forwarding logic in a small struct is not ceremony -- it gives us a home for configuration (which upstream, cached or not) and, more importantly, it lets Zig's error sets (episode 4) enumerate every way a lookup can go wrong so a caller can never forget a case.

// Every distinct failure mode of a lookup, named. A caller switching on this
// can't silently ignore one, because Zig makes exhaustive handling checkable.
const ResolveError = error{
    UpstreamFailed,   // non-200 from the DoH server
    ResponseTooLarge, // body exceeded our 64 KiB ceiling
    NetworkDown,      // TLS/TCP couldn't even connect
    OutOfMemory,
};

// A DoH endpoint, chosen at init by an enum so the call site reads cleanly.
const Upstream = enum {
    cloudflare,
    google,

    fn url(self: Upstream) []const u8 {
        return switch (self) {
            .cloudflare => "https://cloudflare-dns.com/dns-query",
            .google => "https://dns.google/dns-query",
        };
    }
};

const Resolver = struct {
    alloc: std.mem.Allocator,
    upstream: Upstream,
    cache: Cache,

    fn init(alloc: std.mem.Allocator, upstream: Upstream) Resolver {
        return .{ .alloc = alloc, .upstream = upstream, .cache = Cache.init(alloc) };
    }

    fn deinit(self: *Resolver) void {
        self.cache.deinit();
    }

    // Look up an opaque DNS query: cache first, upstream on a miss.
    fn resolve(self: *Resolver, query: []const u8) ![]u8 {
        if (try self.cache.get(query)) |hit| return hit; // owned copy, ID patched
        const answer = try forwardDoH(self.alloc, self.upstream.url(), query);
        try self.cache.put(query, answer);
        return self.alloc.dupe(u8, answer); // caller frees its own copy
    }
};

The Upstream enum is a small thing that pays off in readability: the call site wrote Resolver.init(alloc, .cloudflare), which reads like English and makes switching providers a one-word edit rather than a URL floating loose in the code. And because url() is an exhaustive switch over the enum, the day I add a .quad9 variant the compiler refuses to build until I give it a URL -- the missing case is a compile error, not a runtime surprise at 3am. That is episode 6's tagged-enum discipline earning its keep: the type system holds the list of valid choices, and Zig checks I handled all of them.

Where we finally parse DNS: a TTL-aware cache

Here's the twist that makes this more than a dumb relay. Ask any busy machine and you'll find it looks up the same handful of names over and over -- your OS resolver, your browser, background services all hammering the same domains. Forwarding every one of those over a fresh TLS handshake is wasteful and slow. So we cache. And caching is the one place we can no longer treat the DNS packet as opaque, because a cache entry has to expire, and DNS tells you exactly how long an answer is good for: the TTL (time-to-live) field on each record. To read it, we finally crack the packet open -- reusing the very parser we wrote in episode 82.

There is one subtlety that will bite you if you miss it, and it's worth stating loudly. The first two bytes of every DNS message are the transaction ID, a number the client picks to match a reply to its question. Two apps asking for the same name will use different IDs. So the cache key must be the query with the ID stripped off (the question section only), and when we serve a cached answer we must patch the stored response's ID to match whatever the current asker used -- otherwise the app throws our reply away as unsolicited. Cache the question, not the ID; return the answer, with the asker's ID stamped on.

const Entry = struct {
    packet: []u8,  // the full DNS response, owned
    expires: i64,  // unix seconds; from the smallest TTL in the answer
};

const Cache = struct {
    alloc: std.mem.Allocator,
    map: std.StringHashMapUnmanaged(Entry) = .{},

    fn init(alloc: std.mem.Allocator) Cache {
        return .{ .alloc = alloc };
    }

    // The cache key is the query WITHOUT its first two ID bytes: the question,
    // not the client's per-request nonce. Same question -> same cache slot.
    fn keyOf(query: []const u8) []const u8 {
        return if (query.len > 2) query[2..] else query;
    }

    fn get(self: *Cache, query: []const u8) !?[]u8 {
        const entry = self.map.get(keyOf(query)) orelse return null;
        if (std.time.timestamp() >= entry.expires) return null; // stale -> miss
        // Hand back an owned copy with the CURRENT query's ID stamped in.
        const copy = try self.alloc.dupe(u8, entry.packet);
        copy[0] = query[0];
        copy[1] = query[1]; // graft the asker's transaction ID onto our answer
        return copy;
    }
};

The keyOf slice trick is the whole idea in one line: query[2..] drops the two ID bytes and leaves the flags-plus-question, which is the part that actually identifies what was asked. Because two different apps asking for hive.blog produce identical bytes from offset 2 onward, they collide on the same cache slot exactly as we want, and the only thing we have to fix up on the way out is grafting each asker's own ID back onto our stored answer (copy[0] and copy[1]). Store the question, patch the ID -- get that pair right and the cache is transparent; get it wrong and half your lookups mysteriously "fail".

Populating the cache is where episode 82's parser earns its second life. We walk the answer's records just far enough to find the smallest TTL among them (an answer is only fully valid until its shortest-lived record expires) and turn that into an absolute expiry timestamp:

fn put(self: *Cache, query: []const u8, answer: []const u8) !void {
    // Reuse ep82's parser to read the minimum TTL across all answer records.
    const ttl = dns.minTtl(answer) catch 60; // default 60s if we can't parse
    const stored = try self.alloc.dupe(u8, answer);

    const key = try self.alloc.dupe(u8, keyOf(query));
    try self.map.put(self.alloc, key, .{
        .packet = stored,
        .expires = std.time.timestamp() + @as(i64, ttl),
    });
}

The dns.minTtl helper is the only DNS-aware code in the whole proxy, and it's a thin skim over episode 82's parser -- walk the header to find how many answers there are, skip the question, then read each record's 4-byte TTL and keep the smallest:

// Return the smallest TTL (seconds) across all answer records, reusing ep82's
// reader. This is the ONLY place the proxy actually understands DNS.
fn minTtl(packet: []const u8) !u32 {
    var r = dns.Reader.init(packet); // ep82: cursor over the message bytes
    const header = try r.readHeader();
    try r.skipQuestions(header.qdcount); // jump past the question section

    var smallest: u32 = std.math.maxInt(u32);
    var i: usize = 0;
    while (i < header.ancount) : (i += 1) {
        const rr = try r.readRecord(); // name, type, class, TTL, rdata
        smallest = @min(smallest, rr.ttl);
    }
    // No answers (NXDOMAIN etc.) -> nothing to cache long; keep it brief.
    return if (header.ancount == 0) 30 else smallest;
}

We lean on header.ancount -- the answer count the DNS header hands us -- to know how many records to read, and take the running @min of their TTLs, because an answer set is only trustworthy until its shortest-lived member goes stale. An NXDOMAIN (no such name) carries zero answer records, so there's no TTL to read and we cache the negative result for a short, conservative 30 seconds in stead of forever -- names come into existence, and you don't want a typo'd lookup poisoning the cache for an hour.

Notice the catch 60: if the response is something our parser chokes on -- a record type from the future, a malformed answer -- we do not crash and we do not refuse to cache, we fall back to a conservative 60-second lifetime. A cache that occasionally holds an entry slightly too long or too short is fine; a proxy that crashes on an odd packet is not. Degrade the nicety, never the job -- the same rule that governed the chat server's history push last episode.

Testing the part that can be wrong

Same closing instinct as every episode since number 12: don't test the network, test the logic that can silently drift. The forwarding path we verify by running the proxy against a real resolver -- there's nothing deterministic to unit-test in "did TLS work". But the ID-patching in the cache is pure byte manipulation with a nasty failure mode (return the wrong ID and every reply gets silently dropped by clients), so that is precisely what gets pinned down with no socket in sight.

test "cache patches the current query's ID onto the stored answer" {
    const alloc = std.testing.allocator;
    var cache = Cache.init(alloc);
    defer cache.deinit();

    // A stored answer with ID 0xAAAA and a question of "x".
    const stored = [_]u8{ 0xAA, 0xAA, 0x81, 0x80, 'x' };
    try cache.putRaw(&stored, 300); // helper: store with an explicit TTL

    // A NEW query for the same question, but with a different ID 0x1234.
    const query = [_]u8{ 0x12, 0x34, 0x01, 0x00, 'x' };
    const hit = (try cache.get(&query)).?;
    defer alloc.free(hit);

    // Same body as stored, but the ID must be the asker's, not 0xAAAA.
    try std.testing.expectEqual(@as(u8, 0x12), hit[0]);
    try std.testing.expectEqual(@as(u8, 0x34), hit[1]);
    try std.testing.expectEqualStrings("x", hit[4..]);
}

Running under std.testing.allocator (episode 26's leak detector) buys us a second guarantee for free: every dupe we hand out gets freed, or the test fails on a detected leak rather than passing quietly and rotting in production. The assertion that matters is the ID bytes -- 0x12, 0x34, the asker's ID, not the stored 0xAA, 0xAA -- because that is the exact bug that would make the whole cache look "broken" in a way that's maddening to debug live: replies arrive, clients ignore them, everything times out, and nothing logs an error. Catch it in a two-line test in stead of a two-hour production hunt.

Performance and design considerations

The costs here divide cleanly into "the slow thing" and "everything else". The slow thing is the upstream DoH round trip: a TLS handshake plus an HTTP request over the public internet, tens of milliseconds if you're lucky. Everything the proxy itself does -- the UDP recv, the cache lookup, the ID patch -- is microseconds. So the single highest-leverage optimisation is the cache, and it's why we built it: a cache hit turns a 30ms network round trip into a hash-map lookup and a dupe, a thousandfold win on the hot path. Reusing one std.http.Client across queries matters too, because it keeps the TLS connection to the upstream alive (episode 84's keep-alive) in stead of paying for a fresh handshake every single lookup -- that alone can halve the miss latency.

The one number a careless design could blow is memory, and it's bounded by construction: each cache entry is one small DNS packet, and stale entries stop being served the instant std.time.timestamp() passes their expiry. A production build would add a periodic sweep to actually free expired entries (a timer from episode 70 would do it) rather than just skipping them on read, so a machine that looks up a million distinct names overnight doesn't accumulate a million dead slabs. I'm leaving that sweep as the obvious next stitch rather than building it today, but the seam is right there. The design I'd defend hardest is treating the DNS packet as opaque everywhere except the cache -- it means the proxy will faithfully forward query types it has never heard of, because it never claimed to understand them in the first place. Complexity you don't take on can't bite you.

How this compares to C, Rust, and Go

In C, the UDP relay is the familiar recvfrom/sendto pair we wrote, near-identical since Zig's posix layer is a thin skin over the same syscalls. The pain is everything above the socket: you are not using libcurl for a hobby DoH client without a day of setup, and if you hand-roll the HTTPS you're wiring OpenSSL's SSL_read/SSL_write yourself and manually freeing every buffer, with the ID-patch and the cache-key slice being raw pointer arithmetic the compiler won't check. It works, and a lot of production resolvers are C, but every ownership question we named out loud is one C leaves to your diligence.

In Rust, you'd reach for tokio plus reqwest and the forwarding becomes a short async function, with the borrow checker guaranteeing you never hand out a cache entry that's been evicted -- the exact class of use-after-free we avoid by copying on the way out. The cache would likely be a HashMap<Vec<u8>, Entry> behind a Mutex or a purpose-built crate like moka, and Vec<u8> owning its bytes makes the key-lifetime question we handled with a manual dupe simply disappear. The cost is the async runtime and the dependency tree; the safety is real and mostly free at runtime.

In Go, this is almost unfair -- net.ListenPacket for the UDP side, http.Client for the DoH POST, a map[string][]byte behind a sync.RWMutex for the cache, and the whole thing is maybe eighty lines with the GC erasing every dupe and free we spent paragraphs on. It is the pragmatic choice if you just want the tool. What our version buys, and the reason we wrote it this way, is that you can point at the exact byte that gets copied, the exact slice that becomes the cache key, the exact two bytes that get patched. For learning, seeing the machinery is the value; for shipping, pick whichever of these lets you sleep.

Where this is heading

Step back at what we assembled. A UDP listener from episode 81 takes plaintext queries; a DoH POST built on episodes 84 and 86 forwards them encrypted; episode 82's parser reads a TTL so a cache from episode 22's hash maps can turn repeat lookups into microseconds; and episode 4's error sets make every failure mode a named, handled thing. Run it, point dig @127.0.0.1 -p 5353 hive.blog at it, and watch a plaintext question come out the far side as private HTTPS -- a real privacy tool, built from parts we understand down to the byte. Not one piece of it was new; the composition was the whole exercise.

And that composition is the actual lesson of this mini project, more than DNS or DoH specifically. A proxy is just "speak protocol A on one side, protocol B on the other, and translate in the middle", and that pattern is everywhere -- it's how load balancers, API gateways, and privacy tools all work under the hood. We built one that translates transport (UDP to HTTPS). The natural next move is a proxy that doesn't just relay bytes but inspects and routes them, and before that, tools that go out and actively probe the network rather than waiting to be asked. There's a whole run of those ahead -- programs that measure, that scan, that route under load -- and the kit we've been sharpening for a hundred episodes is exactly what they're built from. We'll aim it at the next shape of problem soon.

Bedankt en tot de volgende keer!

@scipio