• CONTACT
  • Privacy Policy
  • Blog
  • Terms & Conditions
  • About Us
Crypto Tag News
  • Home
  • Blockchain
  • Crypto
    • Bitcoin
    • Ethereum
    • Forex
    • Tether
  • Market
    • Binance
    • Business
    • Investor
    • Money
    • Trading
Reading: NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training
Share
  • bitcoinBitcoin(BTC)$107,644.00
  • ethereumEthereum(ETH)$2,495.89
  • tetherTether(USDT)$1.00
  • rippleXRP(XRP)$2.31
  • binancecoinBNB(BNB)$663.29
  • solanaSolana(SOL)$171.73
  • usd-coinUSDC(USDC)$1.00
  • dogecoinDogecoin(DOGE)$0.221797
  • cardanoCardano(ADA)$0.74
  • tronTRON(TRX)$0.269480
Crypto Tag NewsCrypto Tag News
Aa
  • Home
  • Blockchain
  • Crypto
  • Market
Search
  • Home
  • Blockchain
  • Crypto
    • Bitcoin
    • Ethereum
    • Forex
    • Tether
  • Market
    • Binance
    • Business
    • Investor
    • Money
    • Trading
Have an existing account? Sign In
Follow US
© Crypto Tag NEWS. All Rights Reserved.
Crypto Tag News > Blog > Market > NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training
Market

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

snifferius
Last updated: 2025/05/09 at 2:03 AM
snifferius Published May 9, 2025
Share


Contents
Advancements in Data CurationInnovative Pipeline FeaturesImpact on LLM TrainingGetting Started with Nemotron-CC


Joerg Hiller
May 07, 2025 15:38

NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training.



NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.

Advancements in Data Curation

The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.

Innovative Pipeline Features

The pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.

Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.

Impact on LLM Training

Training LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.

Getting Started with Nemotron-CC

The Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.

For more information, visit the NVIDIA blog.

Image source: Shutterstock


You Might Also Like

CZ Slams WSJ for Claiming He Helped Trump-Linked Crypto Deal

Junk bonds: CCC vs B spread offers a warning sign for debt

Biden Kicks Off Fiscal 2025 With Yet Another Big Budget Deficit – Investment Watch Blog

Regret That House? You’re Not Alone, Here’s Why

6 Discipline Habits That Are Secretly Destroying Your Progress

TAGGED: Dataset, Enhanced, LLM, NemotronCC, Nvidia, Training, TrillionToken, unveils

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share this Article
Facebook Twitter Email Copy Link Print
Previous Article AI Agents and Altseason Take Center Stage as Tariff Talks Fade
Next Article Meta exploring stablecoin integration for payouts: Report
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Socials
Facebook Like
Twitter Follow
Youtube Subscribe
Telegram Follow

Subscribe to our newslettern

Get Newest Articles Instantly!

- Advertisement -
Ad image
Popular News
CZ Slams WSJ for Claiming He Helped Trump-Linked Crypto Deal
Understanding Bitcoin: A Beginner’s Guide to the World of Cryptocurrency
Exploring the Impact of Cryptocurrency Regulations on Global Finance

Follow Us on Socials

We use social media to react to breaking news, update supporters and share information

Twitter Youtube Telegram Linkedin
Crypto Tag News

We influence 20 million users and is the number one business blockchain and crypto news network on the planet.

Subscribe to our newsletter

You can be the first to find out the latest news and tips about trading, markets...

Ad image

© Crypto Tag NEWS. All Rights Reserved.

Removed from reading list

Undo
Welcome Back!

Sign in to your account

Lost your password?