"Well that's interesting"

Written by Dean Zarras | April 17, 2026

In prepping for the release of ClearFactr's MPC server, where you'll be able to create from scratch, edit, and compute with models using natural language prompts, I tripped up Claude Desktop on something. It's the kind of thing that might be impossible to find, or otherwise know about. Blog post worthy for sure!

I asked Claude to build me a Black Scholes options pricing model, straight into ClearFactr. That is, not an Excel file that I'd run through ClearFactr's importer, but where I'd hit the refresh button of ClearFactr's Model Browser, and it would just appear there.

Claude was very thorough, and did a reconciliation of its own work before writing the formulas to the new model. It highlighted the following to me:

Admittedly, I was quite alarmed!

Amongst the thousands of unit tests we have for the "inner sanctum" of the product -- the valuation engine -- after 13+ years, had we somehow missed this? Before I ran off to add yet another unit test, to isolate the problem and fix it, I had this little exchange with Claude Desktop:

Worth repeating here: "...so you'd never spot it from internal checks alone... it masked the error completely."

*It's worth noting that when you run an Excel file into ClearFactr's importer, it also does a cell-by-cell reconciliation.

I first ran to Excel to recreate the situation and a few variations, and then coded things up on my side, without changing any valuation engine code. In cases like this, the process is always the same:

Recreate the bug in a unit test and watch the test fail
Find and fix the bug
Re-run the test and see that it passes

But my new unit tests passed on the first try. No failures.

I double-checked the formulas, scratched my head, and then went back to Claude (note the example formula is slightly different here):

Now I thought, what would Grok say?

Here's the output from that convo (blue highlighting is mine):

So now I went back to Claude, and had this quick exchange

What to make of all this?

For all of the amazing progress in the ability of LLMs to build compute models with the Excel language, increasingly subtle, fine-tuning work remains to be done. ClearFactr's unit tests give us the confidence that we're comparing the behavior of our code with a reference standard. In the case of LLM-generated models, it's a bit of a brain bender to wonder what that standard needs to be in any given situation, and how a user would implement the tests.

In this case involving a Black Scholes model, someone would need to compare the results of the new LLM-generated model to the outputs and behaviors of a trusted alternative. In this case, ClearFactr quickly identified 6 inputs, and 30 outputs. It's an incredibly useful little model, but let's emphasize the word "little." What someone might need to do with 60 inputs and 300 outputs, or dramatically more than that, is the stuff for many more, and very tedious, conversations.

We'd love to hear your thoughts!

View full post