SmolLM2 and SmolVLM Models by Hugging Face: A Comprehensive Guide

Updated on Nov 10, 2024

9 min read

As AI models grow in size and complexity, deploying them in resource-constrained environments can be challenging. Hugging Face’s SmolLM2 family of models aims to address this by offering efficient, lightweight language models that run locally without requiring high-end hardware or internet access. This guide will introduce you to the SmolLM2 ecosystem, explain its capabilities, and help you get started with these powerful compact models.

SmolLM2 Models by Hugging Face: A Comprehensive Guide

What is SmolLM2?

SmolLM2 is a family of compact language models available in three sizes: 135M, 360M, and 1.7B parameters. Developed by Hugging Face, these models are designed to run efficiently on-device while delivering impressive performance across a variety of natural language processing tasks. The SmolLM2 models represent significant advancements over their predecessors, particularly in areas like instruction-following, knowledge retrieval, reasoning, and mathematics.

Key Features of SmolLM2

Compact yet Powerful: Despite their small sizes, SmolLM2 models demonstrate remarkable capabilities, with the 1.7B variant outperforming other models with less than 2B parameters.
Efficient On-Device Operation: Specifically optimized for deployment on devices with limited computational resources, such as smartphones (an iPhone 15 with 6GB RAM can run these models). For more on this topic, check out our guide on edge AI computing.
Multiple Size Options:
- SmolLM2-135M: Ultra-lightweight model for basic text tasks
- SmolLM2-360M: Balanced model for general use
- SmolLM2-1.7B: Most capable variant with advanced reasoning abilities
Advanced Training: The models were trained on an impressive 11 trillion tokens using diverse, high-quality datasets including FineWeb-Edu, DCLM, The Stack, and specialized mathematics and coding datasets. Effective data cleaning techniques were essential to achieving this quality.
Instruction-tuned Variants: All models have instruction-tuned versions optimized for assistant-like interactions, with the 1.7B version supporting tasks like text rewriting, summarization, and function calling.

For more information, check the official SmolLM2 models collection and the technical paper.

The SmolLM2 Ecosystem

The SmolLM2 ecosystem has evolved to include more than just language models and tools. It now encompasses:

SmolLM2 Models

The core language models available in multiple sizes and variants:

Base models: Fundamental versions trained on general text data
Instruct models: Fine-tuned versions optimized for following instructions and chat
Quantized versions: Further optimized models for even more efficient deployment

SmolVLM (Vision-Language Model)

A new addition to the Smol family is SmolVLM, a compact multimodal model that can:

Process both images and text
Perform visual question answering
Generate image descriptions
Create visual stories
Handle multiple images in a single conversation

These capabilities make SmolVLM particularly valuable for computer vision applications and image recognition and classification systems in resource-constrained environments.

High-Quality Datasets

The ecosystem includes several datasets developed specifically for training small but powerful models:

SmolTalk: An instruction-tuning dataset for creating conversational capabilities
FineMath: A specialized mathematics pretraining dataset
FineWeb-Edu: Educational content for pretraining

Local Inference Tools

The repository provides tools for running inference locally across different platforms:

smollm_local_inference: For text-based models
smolvlm_local_inference: For vision-language models

What are Smol-tools?

Smol-tools is a collection of lightweight, AI-powered tools that enhance the utility of SmolLM2 and other small language models. Built with LLaMA.cpp, smol-tools enables a range of NLP tasks without requiring internet access or GPUs, making it ideal for local, offline applications.

Key Features of Smol-tools

The smol-tools suite includes:

SmolSummarizer: Quickly generates concise summaries of text, retaining essential points. Capable of answering follow-up questions based on the summarized content.
SmolRewriter: Enhances text readability by rephrasing content to appear more professional while preserving its original intent, ideal for email or message drafting.
SmolAgent: An AI agent designed to perform tasks by integrating external tools. It includes:
- Weather Lookup: Provides weather updates for specified locations.
- Random Number Generation: Offers random numbers for quick testing or interactive applications.
- Current Time: Returns the current time.
- Web Browser Control: Supports basic browser control for web-based tasks.
- Extensible Tool System: Developers can integrate additional tools into SmolAgent for custom functionality.

Getting Started with SmolLM2

The SmolLM2 models are easily accessible through the Hugging Face Transformers library. Here’s how to get started with using these powerful compact models:

System Requirements

SmolLM2 models are designed to run on modest hardware:

CPU Usage: All models can run on standard CPUs
Memory Requirements:
- SmolLM2-135M: ~500MB RAM
- SmolLM2-360M: ~1GB RAM
- SmolLM2-1.7B: ~4GB RAM
Storage: Each model size requires corresponding disk space for the model files

Installation and Basic Usage

Using Transformers Library

pip install transformers

Then in Python:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Choose your preferred model size
checkpoint = "HuggingFaceTB/SmolLM2-1.7B-Instruct"  # or other variants
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

# For a simple chat interaction
messages = [{"role": "user", "content": "Write a short summary of the benefits of small language models."}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using SmolVLM for Vision Tasks

For multimodal tasks with SmolVLM:

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import requests

# Load the model and processor
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")

# Load and process an image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/cat.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's in this image?"}
        ]
    }
]

# Process the input and generate a response
inputs = processor(messages, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Local Deployment Options

The SmolLM2 repository provides tools for efficient local deployment:

Web Demos: Try the models in your browser with WebGPU demos:
- SmolLM2-135M WebGPU Demo
- SmolLM2-360M WebGPU Demo
Optimized Formats:
- ONNX checkpoints for faster inference
- GGUF versions compatible with llama.cpp
GitHub Repository: The SmolLM GitHub repository contains code for:
- Pre-training
- Post-training optimization
- Evaluation
- Local inference

Performance and Limitations of SmolLM2

SmolLM2 models show impressive performance for their size, outperforming other small language models (SLMs) with similar parameter counts. However, they do have some limitations to consider:

Performance Benchmarks

According to official evaluations, SmolLM2 models demonstrate strong capabilities:

SmolLM2-135M outperforms other models with less than 200M parameters
SmolLM2-360M surpasses all models with less than 500M parameters
SmolLM2-1.7B leads performance among models with less than 2B parameters, including Phi1.5 and MobileLLM-1.5B

On benchmarks like HellaSwag and ARC, the models show strong reasoning and common knowledge capabilities, with the 1.7B model scoring 68.7 and 60.5 respectively.

Limitations

Despite their strengths, users should be aware of certain limitations:

Language Support: SmolLM2 models primarily understand and generate content in English.
Factual Accuracy: As with all language models, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These considerations are part of broader discussions on AI ethics and responsible development.
Context Length: Base models have a 2048 token context window, which may be limiting for some applications (though this can be extended with long-context fine-tuning).
Task Complexity: While capable of many tasks, very complex reasoning or specialized domain knowledge may still require larger models.
Computational Ceiling: For extremely demanding enterprise applications, these models may eventually hit performance ceilings that larger models would not.

Applications of SmolLM2

The efficiency and performance of SmolLM2 models make them suitable for numerous practical applications:

Edge Computing

Mobile Applications: Run AI capabilities directly on smartphones without cloud dependencies
IoT Devices: Enable natural language interfaces on memory-constrained IoT devices
Smart Home Systems: Power voice assistants and smart home controllers with local processing

As detailed in our edge AI computing guide, these on-device models are revolutionizing what’s possible with local processing.

Privacy-Focused Solutions

Healthcare Applications: Process sensitive patient data locally without transmission to external servers
Personal AI Assistants: Keep personal conversations and data on-device
Enterprise Security: Enable NLP in high-security environments where data cannot leave local systems

Educational Tools

Offline Learning Applications: Provide AI tutoring in areas with limited internet connectivity
Language Learning Tools: Create interactive language exercises that run locally
Coding Assistants: Offer programming help on lightweight development environments

Creative Applications

Writing Assistance: Provide on-device text generation, summarization, and rewriting
Content Creation: Support creative workflows with local AI tools
Multimodal Experiences: With SmolVLM, enable vision-language applications locally

Deployment Examples

Raspberry Pi Applications: Run inference on Raspberry Pi 4 with 4GB RAM
Browser-Based Tools: Leverage WebGPU demos for client-side AI processing
Offline Documentation Systems: Create smart documentation browsers that work without connectivity

Conclusion

SmolLM2 represents a significant advancement in making powerful AI capabilities accessible in resource-constrained environments. By offering multiple model sizes (135M, 360M, and 1.7B parameters) that deliver impressive performance while maintaining a small footprint, Hugging Face has created a solution that addresses the growing need for on-device AI.

The SmolLM2 ecosystem has expanded beyond just language models to include vision-language models, specialized datasets, and tools for local deployment. This comprehensive approach enables developers to implement sophisticated AI features in applications running on modest hardware, from smartphones to IoT devices.

What makes the SmolLM2 family particularly valuable is its balance of efficiency and capability. The models outperform others in their respective size categories across various benchmarks, while maintaining reasonable memory and processing requirements that make them suitable for local execution.

As edge AI continues to grow in importance—driven by privacy concerns, the need for offline functionality, and the desire to reduce cloud computing costs—compact yet powerful models like SmolLM2 will play an increasingly crucial role in democratizing access to AI technology.

Whether you’re building mobile applications, privacy-focused tools, educational resources, or creative assistants, SmolLM2 provides a practical foundation for implementing AI capabilities that run locally, respond quickly, and respect user privacy.