Nvidia's new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput

Multi-agent systems, designed to handle long-horizon tasks like software engineering or cybersecurity triaging, can generate up to 15 times the token volume of standard chats — threatening their cost-effectiveness in handling enterprise tasks.

But today, Nvidia sought to help solve this problem with the release of Nemotron 3 Super, a 120-billion-parameter hybrid model, with weights posted on Hugging Face.

By merging disparate architectural philosophies—state-space models, transformers, and a novel "Latent" mixture-of-experts design—Nvidia is attempting to provide the specialized depth required for agentic workflows without the bloat typical of dense reasoning models, and all available for commercial usage under mostly open weights.

Triple hybrid architecture

At the core of Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precision reasoning. The model utilizes a Hybrid Mamba-Transformer backbone, which interleaves Mamba-2 layers with strategic Transformer attention layers.

To understand the implications for enterprise production, consider the "needle in a haystack" problem. Mamba-2 layers act like a "fast-travel" highway system, handling the vast majority of sequence processing with linear-time complexity. This allows the model to maintain a massive 1-million-token context window without the memory footprint of the KV cache exploding. However, pure state-space models often struggle with associative recall.

To fix this, Nvidia strategically inserts Transformer attention layers as "global anchors," ensuring the model can precisely retrieve specific facts buried deep within a codebase or a stack of financial reports.

Beyond the backbone, the model introduces Latent Mixture-of-Experts (LatentMoE). Traditional Mixture-of-Experts (MoE) designs route tokens to experts in their full hidden dimension, which creates a computational bottleneck as models scale. LatentMoE solves this by projecting tokens into a compressed space before routing them to specialists.

This "expert compression" allows the model to consult four times as many specialists for the exact same computational cost. This granularity is vital for agents that must switch between Python syntax, SQL logic, and conversational reasoning within a single turn.

Further accelerating the model is Multi-Token Prediction (MTP). While standard models predict a single next token, MTP predicts several future tokens simultaneously. This serves as a "built-in draft model," enabling native speculative decoding that can deliver up to 3x wall-clock speedups for structured generation tasks like code or tool calls.

The Blackwell advantage

For enterprises, the most significant technical leap in Nemotron 3 Super is its optimization for the Nvidia Blackwell GPU platform. By pre-training natively in NVFP4 (4-bit floating point), Nvidia has achieved a breakthrough in production efficiency.

On Blackwell, the model delivers 4x faster inference than 8-bit models running on the previous Hopper architecture, with no loss in accuracy.

In practical performance, Nemotron 3 Super is a specialized tool for agentic reasoning.

It currently holds the No. 1 position on the DeepResearch Bench, a benchmark measuring an AI's ability to conduct thorough, multi-step research across large document sets.

Benchmark

Nemotron 3 Super

Qwen3.5-122B-A10B

GPT-OSS-120B

General Knowledge

MMLU-Pro

83.73

86.70

81.00

Reasoning

AIME25 (no tools)

90.21

90.36

92.50

HMMT Feb25 (no tools)

93.67

91.40

90.00

HMMT Feb25 (with tools)

94.73

89.55

—

GPQA (no tools)

79.23

86.60

80.10

GPQA (with tools)

82.70

—

80.09

LiveCodeBench (v5 2024-07↔2024-12)

81.19

78.93

88.00

SciCode (subtask)

42.05

42.00

39.00

HLE (no tools)

18.26

25.30

14.90

HLE (with tools)

22.82

—

19.0

Agentic

Terminal Bench (hard subset)

25.78

26.80

24.00

Terminal Bench Core 2.0

31.00

37.50

18.70

SWE-Bench (OpenHands)

60.47

66.40

41.9

SWE-Bench (OpenCode)

59.20

67.40

—

SWE-Bench (Codex)

53.73

61.20

—

SWE-Bench Multilingual (OpenHands)

45.78

—

30.80

TauBench V2

Airline

56.25

66.0

49.2

Retail

62.83

62.6

67.80

Telecom

64.36

95.00

66.00

Average

61.15

74.53

61.0

BrowseComp with Search

31.28

—

33.89

BIRD Bench

41.80

—

38.25

Chat & Instruction Following

IFBench (prompt)

72.56

73.77

68.32

Scale AI Multi-Challenge

55.23

61.50

58.29

Arena-Hard-V2

73.88

75.15

90.26

Long Context

AA-LCR

58.31

66.90

51.00

RULER @ 256k

96.30

96.74

52.30

RULER @ 512k

95.67

95.95

46.70

RULER @ 1M

91.75

91.33

22.30

Multilingual

MMLU-ProX (avg over langs)

79.36

85.06

76.59

WMT24++ (en→xx)

86.67

87.84

88.89

It also demonstrates significant throughput advantages, achieving up to 2.2x higher throughput than gpt-oss-120B and 7.5x higher than Qwen3.5-122B in high-volume settings.

Custom ‘open’ license — commercial usage but with important caveats

The release of Nemotron 3 Super under the Nvidia Open Model License Agreement (updated October 2025) provides a permissive framework for enterprise adoption, though it carries distinct "safeguard" clauses that differentiate it from pure open-source licenses like MIT or Apache 2.0.

Key Provisions for Enterprise Users:

Commercial Usability: The license explicitly states that models are "commercially usable" and grants a perpetual, worldwide, royalty-free license to sell and distribute products built on the model.

Ownership of Output: Nvidia makes no claim to the outputs generated by the model; the responsibility for those outputs—and the ownership of them—rests entirely with the user.

Derivative Works: Enterprises are free to create and own "Derivative Models" (fine-tuned versions), provided they include the required attribution notice: "Licensed by Nvidia Corporation under the Nvidia Open Model License."

The "Red Lines":

The license includes two critical termination triggers that production teams must monitor:

Safety Guardrails: The license automatically terminates if a user bypasses or circumvents the model's "Guardrails" (technical limitations or safety hyperparameters) without implementing a "substantially similar" replacement appropriate for the use case.

Litigation Trigger: If a user institutes copyright or patent litigation against Nvidia alleging that the model infringes on their IP, their license to use the model terminates immediately.

This structure allows Nvidia to foster a commercial ecosystem while protecting itself from "IP trolling" and ensuring that the model isn't stripped of its safety features for malicious use.

‘The team really cooked’

The release has generated significant buzz within the developer community. Chris Alexiuk, a Senior Product Research Enginner at Nvidia, heralded the launch on X under his handle @llm_wizard as a "SUPER DAY," emphasizing the model's speed and transparency. "Model is: FAST. Model is: SMART. Model is: THE MOST OPEN MODEL WE'VE DONE YET," Chris posted, highlighting the release of not just weights, but 10 trillion tokens of training data and recipes.

The industry adoption reflects this enthusiasm:

Cloud and Hardware: The model is being deployed as an Nvidia NIM microservice, allowing it to run on-premises via the Dell AI Factory or HPE, as well as across Google Cloud, Oracle, and shortly, AWS and Azure.

Production Agents: Companies like CodeRabbit (software development) and Greptile are integrating the model to handle large-scale codebase analysis, while industrial leaders like Siemens and Palantir are deploying it to automate complex workflows in manufacturing and cybersecurity.

As Kari Briski, Nvidia VP of AI Software, noted: "As companies move beyond chatbots and into multi-agent applications, they encounter… context explosion."

Nemotron 3 Super is Nvidia's answer to that explosion—a model that provides the "brainpower" of a 120B parameter system with the operational efficiency of a much smaller specialist. For the enterprise, the message is clear: the "thinking tax" is finally coming down.

Source link