Hacker News

Latest

Show HN: Detail, a Bug Finder

2025-12-09 @ 17:35:35Points: 18Comments: 6

Long story below.

--------------------------

We originally set out to work on technical debt. We had all seen codebases with a lot of debt, so we had personal grudges about the problem, and AI seemed to be making it a lot worse.

Tech debt also seemed like a great problem for AI because: 1) a small portion of the work is thinky and strategic, and then the bulk of the execution is pretty mechanical, and 2) when you're solving technical debt, you're usually trying to preserve existing behavior, just change the implementation. That means you can treat it as a closed-loop problem if you figure out good ways to detect unintended behavior changes due to a code change. And we know how to do that – that's what tests are for!

So we started with writing tests. Tests create the guardrails that make future code changes safer. Our thinking was: if we can test well enough, we can automate a lot of other tech debt work at very high quality.

We built an agent that could write thousands of new tests for a typical codebase, most "merge-quality". Some early users merged hundreds of PRs generated this way, but intuitively the tool always felt "good but not great". We used it sporadically ourselves, and it usually felt like a chore.

Around this point we realized: while we had set out to write good tests, we had built a system that, with a few tweaks, might be very good at finding bugs. When we tested it out on some friends' codebases, we discovered that almost every repo has tons of bugs lurking in it that we were able to flag. Serious bugs, interesting enough that people dropped what they were doing to fix them. Sitting right there in peoples codebases, already merged, running in prod.

We also found a lot of vulns, even in mature codebases, and sometimes even right after someone had gotten a pentest.

Under the hood: - We check out a codebase and figure out how to build it for local dev and exercise it with tests. - We take snapshots of the built local dev state. (We use Runloop for this and are big fans.) - We spin up hundreds of copies of the local dev environment to exercise the codebase in thousands of ways and flag behaviors that seem wrong. - We pick the most salient, scary examples and deliver them as linear tickets, github issues, or emails.

In practice, it's working pretty well. We've been able to find bugs in everything from compilers to trading platforms (even in rust code), but the sweet spot is app backends.

Our approach trades compute for quality. Our codebase scans take hours, far beyond what would be practical for a code review bot. But the result is that we can make more judicious use of engineers’ attention, and we think that’s going to be the most important variable.

Longer term, we think compute is cheap, engineer attention is expensive. Wielded properly, the newest models can execute complicated changes, even in large codebases. That means the limiting reagent in building software is human attention. It still takes time and focus for an engineer to ingest information, e.g. existing code, organizational context, and product requirements. These are all necessary before an engineer can articulate what they want in precise terms and do a competent job reviewing the resulting diff.

For now we're finding bugs, but the techniques we're developing extend to a lot of other background, semi-proactive work to improve codebases.

Try it out and tell us what you think. Free first scan, no credit card required: https://detail.dev/

We're also scanning on OSS repos, if you have any requests. The system is pretty high signal-to-noise, but we don't want to risk annoying maintainers by automatically opening issues, so if you request a scan for an OSS repo the results will go to you personally. https://detail.dev/oss

How Private Equity Is Changing Housing

2025-12-09 @ 17:26:33Points: 25Comments: 40

If You're Going to Vibe Code, Why Not Do It in C?

2025-12-09 @ 17:11:09Points: 73Comments: 80

PeerTube is recognized as a digital public good by Digital Public Goods Alliance

2025-12-09 @ 17:08:37Points: 110Comments: 14

Clearspace (YC W23) Is Hiring a Founding Designer

2025-12-09 @ 17:01:11Points: 1

Launch HN: Mentat (YC F24) – Controlling LLMs with Runtime Intervention

2025-12-09 @ 16:37:55Points: 17Comments: 9

https://docs.ctgt.ai/api-reference/endpoint/chat-completions), an API that gives developers deterministic control over LLM behavior, steering reasoning and removing bias on the fly, without the compute of fine-tuning or the brittleness of prompt engineering. We use feature-level intervention and graph-based verification to fix hallucinations and enforce policies.

This resonates in highly regulated industries or otherwise risky applications of AI where the fallout from incorrect or underperforming output can be significant. In financial services, using GenAI to scan for noncompliant communications can be arduous without an easy way to embed complex policies into the model. Similarly, a media outlet might want to scale AI-generated summaries of their content, but reliability and accuracy is paramount. These are both applications where Fortune 500 companies have utilized our technology to improve subpar performance from existing models, and we want to bring this capability to more people.

Here’s a quick 2-minute demo video showing the process: https://video.ctgt.ai/video/ctgt-ai-compliance-playground-cf...

Standard "guardrails" like RAG and system prompts are fundamentally probabilistic: you are essentially asking the model nicely to behave. This often fails in two ways. First, RAG solves knowledge availability but not integration. In our benchmarks, a model given context that "Lerwick is 228 miles SE of Tórshavn" failed to answer "What is 228 miles NW of Lerwick?" because it couldn't perform the spatial inversion.

Second, prompt engineering is brittle because it fights against the model's pre-training priors. For example, on the TruthfulQA benchmark, base models fail ~80% of the time because they mimic common misconceptions found on the internet (e.g. "chameleons change color for camouflage"). We found that we could literally turn up the feature for "skeptical reasoning" to make the model ignore the popular myth and output the scientific fact. This matters because for high-stakes use cases (like Finance or Pharma), "mostly safe" isn't acceptable—companies need audit-grade reliability.

Our work stems from the CS dungeon at UCSD, with years spent researching efficient and interpretable AI, trying to "open the black box" of neural networks. We realized that the industry was trying to patch model behavior from the outside (prompts/filters) when the problem was on the inside (feature activations). We knew this was important when we saw enterprises struggling to deploy basic models despite having unlimited compute, simply because they couldn't guarantee the output wouldn't violate compliance rules. I ended up leaving my research at Stanford to focus on this.

Our breakthrough came while researching the DeepSeek-R1 model. We identified the "censorship" feature vector in its latent space. Amplifying it guaranteed refusal; subtracting it instantly unlocked answers to sensitive questions. This proved the model had the knowledge but was suppressing it. We realized we could apply this same logic to hallucinations, suppressing "confabulation" features to reveal the grounded truth. While some hallucinations stem from the inherent randomness of generative models, many can be identified with the concerted activation of a feature or group of features.

Instead of filtering outputs, we intervene at the activation level during the forward pass. We identify latent feature vectors (v) associated with specific behaviors (bias, misconception) and mathematically modify the hidden state (h):

  h_prime = h - alpha * (h @ v) * v
This arithmetic operation lets us "edit" behavior deterministically with negligible overhead (<10ms on R1). For factual claims, we combine this with a graph verification pipeline (which works on closed weight models). We check semantic entropy (is the model babbling?) and cross-reference claims against a dynamic knowledge graph to catch subtle relational hallucinations that vector search misses.

On GPT-OSS-120b, this approach improved TruthfulQA accuracy from 21% to 70% by suppressing misconception features. We also improved the performance of this model to frontier levels on HaluEval-QA, where we reached 96.5% accuracy, solving the spatial reasoning failures where the baseline failed. It also handles noisy inputs, inferring "David Icke" from the typo "David Of me" where base models gave up. Full benchmarks at https://ctgt.ai/benchmarks.

Most startups in this space are observability tools that tell you only after the model failed. Or they are RAG pipelines that stuff context into the window. Mentat is an infrastructure layer that modifies the model's processing during inference. We fix the reasoning, not just the context. For example, that’s how our system was able to enforce that if A is SE of B, then B is NW of A.

We believe that our policy engine is a superior control mechanism to RAG or prompting. If you’re frustrated with current guardrails, we’d love it if you would stress-test our API!

API: Our endpoint is drop-in compatible with OpenAI’s /v1/chat/completions: https://docs.ctgt.ai/api-reference/endpoint/chat-completions

Playground: We’ve built an "Arena" view to run side-by-side comparisons of an Ungoverned vs. Governed model to visualize the intervention delta in real-time. No signup is required: https://playground.ctgt.ai/

We’d love to hear your feedback on the approach and see what edge cases you can find that break standard models. We will be in the comments all day. All feedback welcome!

Handsdown one of the coolest 3D websites

2025-12-09 @ 16:06:58Points: 163Comments: 41

Ask HN: Should "I asked $AI, and it said" replies be forbidden in HN guidelines?

2025-12-09 @ 16:02:37Points: 342Comments: 189

While the guidelines were written (and iterated on) during a different time, it seems like it might be time to have a discussion about if those sort of comments should be welcomed on HN or not.

Some examples:

- https://news.ycombinator.com/item?id=46164360

- https://news.ycombinator.com/item?id=46200460

- https://news.ycombinator.com/item?id=46080064

Personally, I'm on HN for the human conversation, and large LLM-generated texts just get in the way of reading real text from real humans (assumed, at least).

What do you think? Should responses that basically boil down to "I asked $LLM about $X, and here is what $LLM said:" be allowed on HN, and the guidelines updated to state that people shouldn't critique it (similar to other guidelines currently), or should a new guideline be added to ask people from refrain from copy-pasting large LLM responses into the comments, or something else completely?

AI needs more power than the grid can deliver – supersonic tech can fix that

2025-12-09 @ 15:51:32Points: 26Comments: 32

New Pebble Device

2025-12-09 @ 15:03:09Points: 213Comments: 230

Show HN: Gemini Pro 3 hallucinates the HN front page 10 years from now

2025-12-09 @ 15:00:38Points: 608Comments: 272

Kaiju – General purpose 3D/2D game engine in Go and Vulkan with built in editor

2025-12-09 @ 14:51:46Points: 101Comments: 41

My favourite small hash table

2025-12-09 @ 14:47:20Points: 61Comments: 12

Mistral Releases Devstral 2 (72.2% SWE-Bench Verified) and Vibe CLI

2025-12-09 @ 14:45:01Points: 252Comments: 89

Oliver Sacks Put Himself into His Case Studies. What Was the Cost?

2025-12-09 @ 13:45:01Points: 30Comments: 11

Brent's Encapsulated C Programming Rules (2020)

2025-12-09 @ 11:16:10Points: 55Comments: 27

Show HN: AlgoDrill – Interactive drills to stop forgetting LeetCode patterns

2025-12-09 @ 11:09:06Points: 120Comments: 78

AlgoDrill turns NeetCode 150 and more into pattern-based drills: you rebuild the solution line by line with active recall, get first principles editorials that explain why each step exists, and everything is tagged by patterns like sliding window, two pointers, and DP so you can hammer the ones you keep forgetting. The goal is simple: turn familiar patterns into code you can write quickly and confidently in a real interview.

https://algodrill.io

Would love feedback on whether this drill-style approach feels like a real upgrade over just solving problems once, and what’s most confusing or missing when you first land on the site.

ZX Spectrum Next on the Internet: Xberry Pi ESP01 and Pi Zero Upgrades

2025-12-09 @ 10:52:58Points: 48Comments: 0

The Joy of Playing Grandia, on Sega Saturn

2025-12-09 @ 09:48:55Points: 147Comments: 85

30 Year Anniversary of WarCraft II: Tides of Darkness

2025-12-09 @ 09:13:56Points: 94Comments: 68

Kroger acknowledges that its bet on robotics went too far

2025-12-08 @ 23:53:52Points: 247Comments: 275

Icons in Menus Everywhere – Send Help

2025-12-08 @ 19:44:00Points: 767Comments: 306

The Gamma Language

2025-12-06 @ 08:42:38Points: 26Comments: 4

AWS Trainium3 Deep Dive – A Potential Challenger Approaching

2025-12-04 @ 19:19:05Points: 40Comments: 12

Transformers know more than they can tell: Learning the Collatz sequence

2025-12-03 @ 17:50:17Points: 81Comments: 30

Animalcules and Their Motors

2025-12-03 @ 02:12:17Points: 6Comments: 1

LLM from scratch, part 28 – training a base model from scratch on an RTX 3090

2025-12-02 @ 18:17:48Points: 375Comments: 86

A deep dive into QEMU: The Tiny Code Generator (TCG), part 1 (2021)

2025-12-02 @ 13:42:43Points: 65Comments: 2

Constructing the Word's First JPEG XL MD5 Hash Quine

2025-12-01 @ 22:44:15Points: 74Comments: 16

Epsilon: A WASM virtual machine written in Go

2025-12-01 @ 01:12:03Points: 126Comments: 30

Archives

2025

2024

2023

2022