Run NVIDIA Nemotron 3 Super on Amazon Bedrock

Nemotron 3 Super is now available on Amazon Bedrock as a fully managed and serverless model, joining the Nemotron Nano models that are already available in Amazon Bedrock environments.

with nvidia nemotron With an open model on Amazon Bedrock, you can accelerate innovation and deliver solid business value without having to manage infrastructure complexities. You can power your generative AI applications with Nemotron, using its extensive features and tooling, through Amazon Bedrock’s fully managed inference.

This post explores the technical features of the Nemotron 3 Super model and discusses potential application use cases. It also provides technical guidance to get started using this model for your generative AI applications within the Amazon Bedrock environment.

About Nemotron 3 Super

Nemotron 3 Super is a hybrid blend of experts (MOE) models with leading computation efficiency and accuracy for multi-agent applications and specialized agentic AI systems. The model is released with open weights, datasets, and recipes so that developers can customize, improve, and deploy the model on their infrastructure for better privacy and security.

Model overview:

architecture:
- MoE with hybrid Transformer-Mamba architecture.
- Minimal logic supports token budgeting to provide better accuracy with token generation.
accuracy:
- Highest throughput efficiency in its size category and up to 5 times that of previous Nemotron Super models.
- Leading accuracy for reasoning and agentic tasks among leading open models and 2x higher accuracy than the previous version.
- AIME 2025 achieves high accuracy in key benchmarks including terminal-bench, SWE bench verified and multilingual, RULER.
- Multi-environment RL training provided model leading accuracy across 10+ environments nvidia nemo.
Model size: 120 B with 12 B active parameters
Context length: up to 256K tokens
Model input: text
Model output: text
Languages: English, French, German, Italian, Japanese, Spanish and Chinese

Latent MOE

Nemotron 3 uses super latent MOE, where experts work on a shared latent representation before projecting the output back into the token space. This approach allows the model to call up to 4 times more experts at the same inference cost, enabling better expertise around subtle semantic structures, domain abstractions or multi-hop logic patterns.

Multi-Token Prediction (MTP)

MTP enables the model to predict multiple future tokens in a single forward pass, significantly increasing throughput for long argument sequences and structured output. For planning, trajectory generation, extended chain of thoughtor code generation, reducing MTP latency and improving agent responsiveness.

To learn more about the architecture of Nemotron 3 Super and how it is trained, see Introducing Nemotron 3 Super: An Open Hybrid Mamba Transformer MoE for Agentic Reasoning.

NVIDIA Nemotron 3 Super Use Cases

Nemotron 3 Super helps power a variety of use cases for different industries. Some use cases include

Software development: Assisting with tasks such as code summarization.
Finance: Speed up loan processing by extracting data, analyzing income patterns and detecting fraudulent operations, which can help reduce cycle time and risk.
Cybersecurity: This can be used to test issues, conduct in-depth malware analysis, and proactively hunt down security threats.
Search: Can help understand user intent to activate the right agents.
Retail: Can help optimize inventory management and enhance in-store service with real-time, personalized product recommendations and support.
Multi-agent workflows: Orchestrates task-specific agents—planning, device usage, validation, and domain execution—to automate complex, end-to-end business processes.

Get started with NVIDIA Nemotron 3 Super in Amazon Bedrock. Complete the following steps to test NVIDIA Nemotron 3 Super in Amazon Bedrock

navigate to amazon bedrock console choose more chat/text playground From the left menu (under). Examination section).
choose select model In the upper left corner of the playing field.
choose NVIDIA From the category list, then select nvidia nemotron 3 super.
choose apply To load the model.

After completing the previous steps, you can immediately test the model. to actually display nemotron 3 super capability, we will move beyond simple syntax and approach it with a complex engineering challenge. High-logic models excel at “systems-level” thinking, where they must balance architectural trade-offs, concurrency, and distributed state management.

Let’s use the following hint to design a globally distributed service:

"Design a distributed rate-limiting service in Python that must support 100,000 requests per second across multiple geographic regions.

1. Provide a high-level architectural strategy (e.g., Token Bucket vs. Fixed Window) and justify your choice for a global scale. 2. Write a thread-safe implementation using Redis as the backing store. 3. Address the 'race condition' problem when multiple instances update the same counter. 4. Include a pytest suite that simulates network latency between the app and Redis."

This prompt requires the model to work as a senior distributed-systems engineer would—reasoning about trade-offs, producing thread-safe code, anticipating failure modes, and validating everything with realistic tests, all in a single coherent response.

Using AWS CLI and SDK

You can access the model programmatically using the model ID nvidia.nemotron-super-3-120b . The model supports both invoke model And reverse API via AWS Command Line Interface (AWS CLI) and AWS SDK nvidia.nemotron-super-3-120b As model ID. Additionally, it supports Amazon Bedrock OpenAI SDK compatible APIs.

Run the following command to invoke the model directly from your terminal using AWS Command Line Interface (AWS CLI) and InvokeModel API: :

aws bedrock-runtime invoke-model  
 --model-id nvidia.nemotron-super-3-120b  
 --region us-west-2  
 --body '{"messages": ({"role": "user", "content": "Type_Your_Prompt_Here"}), "max_tokens": 512, "temperature": 0.5, "top_p": 0.9}'  
 --cli-binary-format raw-in-base64-out  
invoke-model-output.txt

If you want to implement the model through the AWS SDK for Python (Boto3), Use the following script to send signals to the model, in this case using the Converse API:

import boto3 
from botocore.exceptions import ClientError 

# Create a Bedrock Runtime client in the AWS Region you want to use. 
client = boto3.client("bedrock-runtime", region_name="us-west-2") 

# Set the model ID
model_id = "nvidia.nemotron-super-3-120b" 

# Start a conversation with the user message. 

user_message = "Type_Your_Prompt_Here" 
conversation = ( 
   { 
       "role": "user", 

       "content": ({"text": user_message}), 
   } 
)  

try: 
   # Send the message to the model using a basic inference configuration. 
   response = client.converse( 
        modelId=model_id, 

       messages=conversation, 
        inferenceConfig={"maxTokens": 512, "temperature": 0.5, "topP": 0.9}, 
   ) 
 
   # Extract and print the response text. 
    response_text = response("output")("message")("content")(0)("text") 
   print(response_text)

except (ClientError, Exception) as e: 
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}") 
    exit(1)

To deploy the model through an Amazon Bedrock OpenAI-compliant ChatCompletion endpoint you can proceed as follows using the OpenAI SDK:

# Import OpenAI SDK
from openai import OpenAI

# Set environment variables
os.environ("OPENAI_API_KEY") = ""
os.environ("OPENAI_BASE_URL") = "https://bedrock-runtime..amazon.com/openai/v1"

# Set the model ID
model_id = "nvidia.nemotron-super-3-120b"

# Set prompts
system_prompt = “Type_Your_System_Prompt_Here”
user_message = "Type_Your_User_Prompt_Here"


# Use ChatCompletionsAPI
response = client.chat.completions.create(
    model= model _ID,                 
    messages=(
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": user_message}
    ),
    temperature=0,
    max_completion_tokens=1000
)
 
# Extract and print the response text
print(response.choices(0).message.content)

conclusion

In this post, we showed you how to get started with NVIDIA Nemotron 3 Super on Amazon Bedrock for building next-generation agentic AI applications. By combining the model’s advanced hybrid Transformer-Mamba architecture and latent MOE with Amazon Bedrock’s fully managed, serverless infrastructure, organizations can now deploy highly-sense, efficient applications at scale without the heavy burden of backend management. Are you ready to see what this model can do for your specific workflow?

try it now: Visit the Amazon Bedrock Console to experiment with NVIDIA Nemotron 3 Super in Model Playground.
Construction: Explore the AWS SDKs to integrate Nemotron 3 Super into your existing generative AI pipelines.

About the authors

Run NVIDIA Nemotron 3 Super on Amazon Bedrock

About Nemotron 3 Super

Latent MOE

Multi-Token Prediction (MTP)

NVIDIA Nemotron 3 Super Use Cases

Using AWS CLI and SDK

conclusion

About the authors

NVIDIA Releases Nemotron-Cascade 2: An Open 30B MOE with 3B Active Parameters, Providing Better Logic and Stronger Agent Capabilities

‘On the threshold of a new era’: Inside the New Museum in New York’s $82 million expansion and historic new exhibition archive

Related Articles

Leave a Comment Cancel Reply