This is the 4th post in this series on Reverse Vibe-Coding or RVC. The previous posts can be found here:
In this post we want to have a look at the concept of seed-coding as an alternative for vibe-coding for bigger experiments and non-PO prototyping. Seed-coding is something I ran into by accident independently of my RVC efforts. At first I thought its properties were not real, a pure chance fluke, but after about a work-week of experiments, it turned out seed-coding works. It is a human-time equivalent alternative for vibe-coding for bigger experiments and prototypes. Its maturity is still low enough that like many things in the RVC workflow every shop should find its own way in. Remember RVC isn't a hard workflow, it's an outline for building bespoke team-specific workflows, and as such seed-coding fits right in.
| Type of work | who | timeframe | coding practice |
|---|---|---|---|
| new business logic | developer | - | 100% manual |
| RVPP template/routing repo | developer | - | manual and/or Lode Coding |
| refactoring | developer | - | RVPP-based |
| scaffolding | developer | - | RVPP-based |
| boilerplate | developer | - | RVPP-based |
| living runnable specs | product owner | - | Vibe coding |
| prototype | product-owner | - | Vibe-coding |
| prototype | developer | - | Seed-coding |
| experiments | developer | < half a day | Vibe-coding |
| experiments | developer | > half a day | Seed-coding |
Seed coding is filling only two slots in the Reverse-Vibe-Coding workflow. Note that the half-a-day threshold above is just indicative for now, more on that in the section about experiment results. Because we don't expect product owners to code, prototypes made by the PO don't qualify, but prototypes made by a developer are an important slot where seed-coding should come to its right. A second slot where seed-coding is valuable is when doing some code experiments, but not blanketly. Experiment results showed no benefit from seed-coding over vibe coding for experiments taking up significantly less than than half a working day of human time.
Where with vibe-coding you try to express intent with natural language, English, asking the language model to translate your English intent into architecture, design and code, when doing seed-coding you manually create code, but the code isn't meant as implementation, instead it is meant to more accurately outline intent in the form of code. It is essential that you don't waste time on anything not intent conveying. It is in the instinct of trained developers to want to write good quality code, at least do some error handling, document with comments not just for intent, but for future maintenance. This is an urge you need to step over. All code should be a happy path. Worry about an off-by-one bug? Don't! Code doesn't compile? Spent 10 minutes max, if it's not enough, tough luck. It is useful to timebox the intent creation. One to two thirds of a workday worth of coding seems like a good sweetspot. Smaller and the benefits of seed coding over vibe coding evaporate, bigger and you will end up needing to get back into the flow the next working day on an uncompleted seed which is suboptimal from a human perspective. Tomorrow just starts a new timebox on another part of the prototype or bigger experiment. It is really important to stress we aren't trying to write prototype-pluss-grade code here, the model(s) will help us get there, we are purely codifying intent.
As we saw with the RVPP driven workflow for refactoring, the seed-coding flow too is a triage oriented workflow. But unlike the refactoring workflow, triage doesn't happen on variants, it happens on proposals. The flow is a multi-batch flow with triage interruptions for human interaction. Please note, the below four batch phases are just an example of a seed-coding agentic workflow. Think about and make your own bespoke phases. The only fundamentally RVC part of this example is the strong focus on triage as main human input and no unbounded repair rounds.
In the first batch phase, a small model phase for low drift deterministic results, we route the seed code to language models with prompts and repair loops, prompting the model to:
After batch-phase 1, we have code that at least compiles and doesn't crash on minimum fuzzing.
In the second phase we move our routing from our regular 15b..32b models to 70b or bigger models, more on that later, and we set think=true to enable reasoning. Using cloud models here can be a valid option.
In this zero human batch we present the (potentially patched) seed code to two models and:
Like the previous batch, we again use two bigger models in this step. We are very explicit we only want high-confidence code issues and bugs and ask both models for a one-line per issue code review.
So we end up with two lists again that we treat similar up until and including the deduplication and with a boolean denoting if it's a bug. We ask model C to write a regression test to show that the bugs are real and unconfirmed bugs are removed from the list.
At this point we bring a human into the loop and ask the human to go through the list as triage and remove the lines we don't need.
Then we:
Now for the last batch, here we actually run the fuzz test for a few hours and make a collection of inputs that made the fuzz test trigger. We present the results from the fuzz test to a model and ask it to group the triggers into likely related clusters.
So how to choose between the usual 15b..32b range local models and routing of most of the RVPP bits of RVC and the >70b range models and cloud frontier models common in vibe-coding? The trick is that larger excel in finding things. Bugs, solutions, etc while smaller models are way better at producing reliable and reproducible results from canned prompts and prompt templates. You could say small models obey better but they won’t from my experience suggest an out of memory error flow or point out code that isn’t necessarily wrong ‘yet’, but is fragile. It is always advised to pick the smallest model that can do the job, but as a rule of thumb, when you want to use a 70b+ model or even a frontier cloud model or a 300b+ local model, it will usually be just before you introduce a human triage step, If you find yourself choosing a 300b+ model and after it has done it’s thing there is no human triage in sight, it might be a good idea to experiment with 70b+ or even a 15b..30b+ range model.
Because we are “only” making a prototype or an addition to our experiment library, the TDD bit of this part of the RVC workflow doesn’t need to be anywhere close to perfect. The focus is on relatively clean inspirational code that wont make it to production but that will convey the ideas we need to start with clean hand-written business logic later in the process.
As we discussed in earlier posts, unit tests and agentic loops don’t mix, and for production level code we need extensive property testing, but for prototypes and experiments we are going for the quick low-human-effort wins, and regression tests (for validating bugs were actually there and checking they might be solved), fuzz tests, and smoke-tests are all we logically need for the level of quality we are aiming for. It is important to note that we don’t actually make smoke tests. Instead we turn down the fuzz generator to a uselessly small burst from a fuzz testing perspective and call that the smoke test.
It is relatively easy for a language model to autonomously write a fuzz test and fuzz generator for your code, and computer time comes cheap, so let the fuzz test rip for an hour, or two if you like. We don’t pretend it will find all the bugs, but if there are glaring ones that the LLM reviews didn’t spot, chances are the fuzz tests will tease them out with zero need for human interaction.
So far this post asked you to believe me on my big amber eyes, and I must admit the threshold value in the table of when to use vibe coding and when to use seed-coding is currently chosen on too little and too narrow data. I’m just a single person with a limited amount of time available for doing things in a scientifically sufficient way, but I am a data guy, so I make do with the hours I have available. For this subject, it was a well filled week where I ran 10 coding sessions in pairs or two.
My first experiment was just a what-if thing. I had set up most of the experiment code improvement pipeline on my new homelab Forgejo server, and I wanted to test the idea: “Hey, what if I try to let an LLM write all the error flow code” ? I did just that. I took a bit of hash based signature ideas and coded it in Python, no error handling whatsoever, all prototype level. Took me 3:30 to code. Add 30 minutes for the triage bits and manual tuning, and two hours of machine time, mostly spent on fuzzing, and I had what turned out to be a useful base.
The result: Pretty good error handling code, a few bugs found by the LLM, one by the fuzz test, but what stood out most was the error flow code didn't have the usual vibe-coded smell.
Then I got curious: What If I try to vibe-code the same code. I was ready for the vibe coded version to win because it was tainted by my knowledge from running this pipeline, so I thought about a comparable setup. I decided to spend 4 hours vibe coding, make sure the vibe generator created by the first run would fit, and then plug things into the bugfind/bugfix and fuzz part of the pipeline directly.
The result:
At this point I didn’t know what to think, My seed code was still relatively clean, so I wasn’t sure. I decided I needed to do a second and a third run. I stuck to python and crypto related things, and I found two subjects that I had worked with earlier. But I decided to dial down the quality meter of my own seed code. For my second attempt this was easy because I ran out of time on my self imposed 3:30 timebox and I ended up running the run with non-compiling code (a adjusted the pipeline to handle that).
After the three runs, the first 55% more lines of code had reduced to 45%, and the other findings pretty much centered around my first attempt.
Now I had three data points that all agreed, but they all used the same time window. So my next question to myself was: “What if the time prompting/seed-coding was significantly longer or shorter. So I thought about two other problems to experiment with. For the first I halved the prompting and the seed-coding time to 1:45 and 2:00 respectively, and for the other experiment I doubled it to 7:00 and 8:00.
Results: Both vibe-coding and seed-coding results got a little bit worse, nut the scaled ‘relative’ numbers for the long version fit right in. It was a little bit above the mean on the important metric, but well below the best of three for the three base experiments. In conclusion, throwing more time at the comparison doesn’t make seed-coding less better than vibe-coding on the quality of the outcome with equal human hours invested.
The 1:45 and 2:00 results though weren’t as positive. In fact, the difference between the vibe-coding result and the seed-coding result was roughly zero, and seed-coding was even worse on one point.
The below table shows the vibe-coding version of the experiment against the seed-coding version:
| experiment | human time | LOC | LLM bugs found | fuzzing buggs found (vibe/seed) |
|---|---|---|---|---|
| hash-based signatures WOTS | 4h | 155% | Equal | 4/1 |
| json-rpc & ecdsa | 4h | 130% | Equal | 4/2 |
| certificates & ecdsa | 4h | 150% | Equal | 1/0 |
| hash-based signatures WOTS + Merkle-tree | 8h | 150% | Equal | 5/2 |
| hash based crypto, rumpeltree & aes | 2h | 105% | 50% | 1/1 |
It is important to note that the code in my benchmark was all code experiments, not full prototypes, it was all python, and it was all code without any kind of GUI, so do with this what you must and run your own experiments to find the right thresholds for choosing between vibe-coding and seed-coding for experiments and prototypes in your workflow.
As stated before, I started my experiment while playing with my new homelab Forgejo server. I’m configuring this server as separate from my RVPP git server (that I hope to move to Forgejo too at some point in the future) because RVC values provenance very highly and seed- and vibe-coding are very casual with generated code. Unless we would limit ourselves to starcoder alone, what isn’t a pleasant option, vibe coded and seed-coded code contain relatively free form LLM generated code bits that are likely to contain more or less regurgitated code bits from restrictively licenced open source training data. While fitting in strong due diligence into an RVC workflow is a valid choice, it brings in a lot of technical and logistical overhead. For that reason RVC advocates defaulting to a so called inspiration wall, a git-server where the code from prototypes and experiments lives as a relatively decent quality wall with inspiration. You don’t copy from the wall, you look at the wall, get expired, then go and hand-write initial business logic.
One way to create an inspiration wall is to disable direct git interaction with the experiments and prototypes git server, make it web-only. How you make your inspiration wall is up to you, but consider that copyright and quality concerns make it a solid choice to forgo with the need for due diligence by simply not allowing experiment and prototype code to find its way into production other than through inspiration for human developers.
In this post we discuss an RVC sub-workflow that I ran into by pure chance. I resisted my findings from a natural repulsion from the poor coding practices seed-coding not only allows but actually demands. It feels quite unnatural to ignore good coding practices, but in the end you aren’t really coding, you are using code to express intent. I admit that more data is needed to determine when to use vibe-coding for experiments and prototypes and when seed-coding gives better results as shown in my experiments. Nevertheless I think seed-coding deserves its place within the RVC framework, and maybe it has its use outside or RVC as well.