Foundation Models

Foundation Models are large-scale machine learning models trained on vast amounts of data (often internet-scale) that can be adapted to a wide range of downstream tasks. The term was popularized by the Stanford Institute for Human-Centered AI (HAI) to describe a paradigm shift where a single model serves as the base (foundation) for many different applications, rather than building bespoke models for each specific task.

History: The Path to Generalization

  • The Specific Era (Pre-2017): in the early days of NLP and Computer Vision, models were task-specific. A translation model could not write poetry; an image classifier could not detect objects. Architectures like RNNs and LSTMs were powerful but difficult to parallelize and struggled with long-range dependencies.
  • The Transformer Revolution (2017): Google researchers published “Attention Is All You Need,” introducing the Transformer architecture. This allowed for massive parallelization during training, enabling models to ingest significantly more data.
  • The Pre-training paradigm (2018-2020): Models like BERT (Google, 2018) and GPT (OpenAI, 2018) proved that pre-training on generic text and then “fine-tuning” on specific tasks yielded state-of-the-art results. GPT-3 (2020) demonstrated “few-shot learning,” showing that at sufficient scale, models could perform tasks they were never explicitly trained for, simply by being shown a few examples in the prompt.
  • The Chat Era (2022): The introduction of ChatGPT (RLHF applied to GPT-3.5) solved the usability problem, aligning raw model capability with human intent and making these models accessible to the general public.
  • Native Multimodality: We are moving away from “bolting on” vision or audio modules to text models. New state-of-the-art models (like GPT-4o and Gemini 1.5) are trained natively on text, image, audio, and video, allowing for seamless reasoning across different media types.
  • Reasoning & “System 2” Thinking: Generative AI is moving beyond simple pattern matching (predicting the next word) to performing complex chain-of-thought reasoning. Models like OpenAI’s o1 series are designed to “think” and plan before generating an output, significantly improving performance on math, coding, and logical puzzles.
  • The Rise of Small Language Models (SLMs): While the frontier models grow larger, there is a counter-trend towards efficiency. High-quality small models (like Llama 3 8B, Phi-3, Gemma) are delivering GPT-3.5 level performance on consumer hardware, enabling privacy-preserving, on-device AI.
  • Massive Context Windows: The ability to process vast amounts of information in a single prompt has exploded, moving from 4k tokens to 1 million+ tokens (Gemini 1.5 Pro). This renders many complex RAG architectures obsolete for mid-sized datasets, as entire codebases or books can fit directly into the model’s working memory.

Competitive Vendors

  • OpenAI: The creator of the GPT series (GPT-4o, o1) and DALL-E, currently leading the proprietary model market.
  • Google DeepMind: Developer of the Gemini series, leveraging their vast TP infrastructure and data ecosystem.
  • Anthropic: Founded by former OpenAI employees with a focus on AI safety, creators of the Claude ecosystem (Claude 3.5 Sonnet).
  • Meta AI: The champion of open innovation, releasing the Llama series of open-weights models that power the open-source ecosystem.
  • Mistral AI: A French forerunner efficient, high-performance open-weight models (Mistral 7B, Mixtral 8x7B) and proprietary enterprise models.
  • Cohere: Focused heavily on enterprise use cases, specifically RAG (retrieval-augmented generation) and embeddings.
  • Microsoft: While a major investor in OpenAI, Microsoft Research produces the Phi series, leading the market in efficient Small Language Models.

Notes mentioning this note

Updated: