Hacker News

Latest

Claude for Small Business

2026-05-14 @ 03:59:35Points: 192Comments: 118

Arena AI Model ELO History

2026-05-14 @ 03:19:05Points: 51Comments: 24

I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.

We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.

Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab. It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see. It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.

However, I have a specific data blindspot that I'm hoping this community might have insights on.

Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.

Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?

I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!

Microsoft BitLocker – YellowKey zero-day exploit

2026-05-14 @ 02:45:08Points: 138Comments: 79

Show HN: Nibble

2026-05-14 @ 01:46:45Points: 49Comments: 4

An attempt at a single pass LLVM frontend in ~3000 lines of C without external dependencies, malloc, or an AST. Included are some graphical examples. The IR isn't perfect, and the README touches on one particular downfall

Cisco workforce reductions

2026-05-14 @ 01:38:05Points: 183Comments: 155

delta time

2026-05-14 @ 01:15:24Points: 39Comments: 18

Avoiding and reducing microplastic false positives from dry glove contact

2026-05-14 @ 01:13:02Points: 39Comments: 1

Scorched Earth 2000 – Web

2026-05-14 @ 00:41:29Points: 226Comments: 91

The other half of AI safety

2026-05-14 @ 00:27:20Points: 77Comments: 96

Princeton mandates proctoring for in-person exams, upending 133 year precedent

2026-05-13 @ 20:12:16Points: 320Comments: 466

MacBook Neo Deep Dive: Benchmarks, Wafer Economics, and the 8GB Gamble

2026-05-13 @ 18:30:51Points: 199Comments: 208

Launch HN: Ardent (YC P26) – Postgres sandboxes in seconds with zero migration

2026-05-13 @ 16:54:03Points: 85Comments: 34

https://tryardent.com). We're building database sandboxes for you and your coding agents.

In the last two years coding agents have gotten dramatically more capable at handling complex engineering tasks. But without access to a realistic sandbox at the DB layer for testing, they ship garbage that can take down production databases. I spent over a year building an AI Data Engineer that failed for this exact reason. Evan spent the last 12 years in data engineering and hit this wall building agents at his last company.

Ardent was built to make it possible for coding agents to get near instant access to production-like sandboxes so they can test their work. To do this we write a replication stream out of the target DB, scaling with kafka onto a read replica with copy on write enabled and autoscaling compute (we currently prefer neon as a primary branching engine due to their implementation of these properties).

Our replication stream uses logical replication + ddl triggers to enable usage on any hosted postgres DB since most platforms do not allow physical replication which is traditionally used for creating replicas.

This provides a few primary benefits:

1. Does not require a platform migration to a DB provider like neon, allowing strong separation of production and development concerns. 2. Minimal impact on the production database while allowing clones to spin up in <6s, even at TB scale with copy-on-write

Security matters a lot with cloning production so we run a proxy layer to generate custom postgres URLs and route all connections to allow more granular access control to clones, prevent credential leak, and follow a split plane architecture to allow full data residency on your cloud through BYOC.

We also support anonymization through the ability to register SQL that runs on branches before they are returned. This has been used for PII redaction and branch modification.

Our goal is to make every data infrastructure platform “cloneable” in one place so agents can fully test the impact of their changes on production like data environments without risk.

Here's a demo of it: https://youtu.be/5S1kwPtiRU0

We’d love to understand how you work with coding agents on the DB and if you try Ardent (it's free to get started) what worked, what broke and what’s missing.

S-100 Virtual Workbench

2026-05-13 @ 15:52:00Points: 117Comments: 25

Setting up a free *.city.state.us locality domain (2025)

2026-05-13 @ 14:45:18Points: 552Comments: 172

The US is winning the AI race where it matters most: commercialization

2026-05-13 @ 13:53:53Points: 196Comments: 524

The Emacsification of Software

2026-05-13 @ 07:06:14Points: 280Comments: 185

Twin brothers wipe 96 government databases minutes after being fired

2026-05-12 @ 22:28:07Points: 405Comments: 305

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

2026-05-12 @ 18:03:11Points: 667Comments: 188

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle

Heritability of human life span is ~50% when heritability is redefined

2026-05-12 @ 15:30:05Points: 95Comments: 52

AEPs: API Enhancement Proposals

2026-05-12 @ 12:58:39Points: 14Comments: 5

Can a Language Model Paint?

2026-05-12 @ 07:44:19Points: 24Comments: 8

Golden Testing a CAD Library

2026-05-11 @ 19:12:34Points: 19Comments: 3

Chess puzzle I found in my dad's old book

2026-05-11 @ 16:00:19Points: 151Comments: 40

Marco Polo: Finding a friend with only distance and motion

2026-05-11 @ 14:52:25Points: 58Comments: 8

How can Apple deal with the memory shortage?

2026-05-11 @ 13:49:23Points: 93Comments: 93

Linux gaming is faster because Windows APIs are becoming Linux kernel features

2026-05-10 @ 20:54:06Points: 690Comments: 446

Xs of Y – roguelike that names itself every run. Written in 4kLoC

2026-05-10 @ 03:32:59Points: 181Comments: 75

Reverting the incremental GC in Python 3.14 and 3.15

2026-05-09 @ 20:25:55Points: 221Comments: 88

'A Four-Eyed World' Review: The Story of Spectacles

2026-05-09 @ 18:02:23Points: 8Comments: 0

A History of IDEs at Google

2026-05-09 @ 11:14:18Points: 358Comments: 240

Archives

2026

2025

2024

2023

2022