Hacker News

Latest

Alphabet Announces First Quarter 2026 Results

2026-04-29 @ 20:20:58Points: 92Comments: 100

Pentagon spending on drones jumps from $225M to $55B in one year

2026-04-29 @ 20:03:51Points: 62Comments: 68

OpenTrafficMap

2026-04-29 @ 19:49:30Points: 97Comments: 21

Kyoto cherry blossoms now bloom earlier than at any point in 1,200 years

2026-04-29 @ 19:32:36Points: 160Comments: 46

HERMES.md: Anthropic bug causes $200 extra charge, refuses refund

2026-04-29 @ 18:54:31Points: 878Comments: 342

Copy Fail – CVE-2026-31431

2026-04-29 @ 18:13:53Points: 384Comments: 186

Ramp's Sheets AI Exfiltrates Financials

2026-04-29 @ 17:44:21Points: 84Comments: 27

Laws of UX

2026-04-29 @ 16:58:13Points: 133Comments: 24

Virtualisation on Apple Silicon Macs is different

2026-04-29 @ 16:51:38Points: 51Comments: 15

Maryland becomes first state to ban surveillance pricing in grocery stores

2026-04-29 @ 16:50:01Points: 200Comments: 143

At Protocol: Building the Social Internet

2026-04-29 @ 16:20:56Points: 50Comments: 25

FastCGI: 30 years old and still the better protocol for reverse proxies

2026-04-29 @ 16:16:39Points: 196Comments: 46

Show HN: A new benchmark for testing LLMs for deterministic outputs

2026-04-29 @ 16:01:51Points: 45Comments: 19

The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not.

Structured output today is a big part of using LLMs, especially when building deterministic workflows.

Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON.

So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio.

For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong.

Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4.

We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.

For example, GPT-5.4 ranks 3rd on text but 9th on images.

Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text.

Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks.

Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

Online age verification is the hill to die on

2026-04-29 @ 15:49:57Points: 639Comments: 417

Third Editor Fired in Elsevier's Citation Cartel Crackdown

2026-04-29 @ 15:45:25Points: 200Comments: 61

Cursor Camp

2026-04-29 @ 15:39:43Points: 463Comments: 85

Mistral Medium 3.5

2026-04-29 @ 15:17:46Points: 381Comments: 180

An open-source stethoscope that costs between $2.5 and $5 to produce

2026-04-29 @ 14:47:31Points: 157Comments: 70

Zed 1.0

2026-04-29 @ 14:34:19Points: 1361Comments: 431

How to Build the Future: Demis Hassabis [video]

2026-04-29 @ 14:05:42Points: 62Comments: 35

We need a federation of forges

2026-04-29 @ 14:00:59Points: 498Comments: 313

Letting AI play my game – building an agentic test harness to help play-testing

2026-04-29 @ 12:43:12Points: 117Comments: 23

Stardex Is Hiring a Founding Customer Success Lead

2026-04-29 @ 12:00:17Points: 1

GitHub – DOS 1.0: Transcription of Tim Paterson's DOS Printouts

2026-04-29 @ 11:25:12Points: 113Comments: 6

Soft launch of open-source code platform for government

2026-04-29 @ 09:14:37Points: 502Comments: 115

Why I still reach for Lisp and Scheme instead of Haskell

2026-04-29 @ 08:43:17Points: 131Comments: 29

Bugs Rust won't catch

2026-04-29 @ 02:19:11Points: 620Comments: 333

Before GitHub

2026-04-28 @ 21:17:40Points: 646Comments: 215

Ghostty is leaving GitHub

2026-04-28 @ 19:44:52Points: 3310Comments: 975

Postgres's lateral joins allow for quite the good eDSL

2026-04-27 @ 14:07:25Points: 31Comments: 1

Archives

2026

2025

2024

2023

2022