Datasets

8+ datasets

Explore high-quality datasets for AI research and development

Showing 8 of 8 datasets

500K+ downloads this month
Text
Featured
Trending
1.8TB

Nexus-Gen-Training-Dataset

by Nexus AI

Comprehensive training dataset for next-generation language models with diverse text sources

NLP
Language Model
Training
12.1K
45.7K
289
License: Apache 2.0
2025.07.28
Text
Featured
Trending
340GB

CodeGen-Multilang-Dataset

by CodeAI

Large-scale code generation dataset covering 50+ programming languages with documentation

Code Generation
Programming
Multilingual
11.2K
38.9K
245
License: MIT
2025.07.25
Text
Featured
Trending
120GB

MathX-5M

by MathAI

5 million mathematical problems and solutions for AI training with step-by-step explanations

Mathematics
Problem Solving
Education
9.8K
31.2K
203
License: MIT
2025.07.07
Vision
Featured
Trending
2.3TB

agibot_world_beta

by AgiBot Team

Large-scale robotics dataset for world understanding and manipulation tasks with comprehensive sensor data

Robotics
Computer Vision
Multimodal
8.5K
23.1K
156
License: MIT
2025.07.21
Audio
Featured
Trending
2.1TB

AudioSet-Extended-2025

by Google Research

Extended version of AudioSet with additional categories and improved annotations for audio classification

Audio Processing
Classification
Sound Events
7.3K
28.5K
187
License: CC BY 4.0
2025.07.20
Text
Featured
890GB

Chinese-Qwen3-235B-2507-Distill

by Alibaba

Distilled Chinese language dataset optimized for Qwen3 model training with high-quality annotations

Chinese
Language Model
Distillation
6.7K
18.9K
134
License: Custom
2025.07.27
Multimodal
650GB

VisionQA-Multilingual

by Vision Team

Multilingual visual question answering dataset with rich annotations and cultural diversity

Vision-Language
QA
Multilingual
4.9K
16.8K
112
License: CC BY-SA 4.0
2025.07.15
Benchmark
45GB

kontext-bench

by Research Lab

Benchmark dataset for contextual understanding and reasoning tasks across multiple domains

Benchmark
Reasoning
Evaluation
3.2K
12.4K
78
License: CC BY 4.0
2025.07.04

Showing 8 of 25,000+ datasets