A (very) brief history of LMMs

Large Multimodal Model (LMM): A large-scale model consisting of multiple modalities, typically a mixture of image processing and generation models, combined with a Large Language Model (LLM). See Chip Huyen’s post for a great dive into the concepts: Multimodality and Large Multimodal Models (LMMs).

Update 10th July 2025

This is a quick addition to my article to call out that David Knott (CTO to the UK Government) has published an article of a similar vein but taking a slightly different and altogether very informative angle. Its well worth a read and you can access this article here: AI – a catch up guide to early episodes | LinkedIn

Background

…not a great position to be in really, especially when the technological developments driving this chaos seem to be … best described as ‘incredibly powerful unreliable magic’

I’ve been publicly writing and discussing for some time now on the current general feeling of overwhelm faced by engineers and other tech-based roles. On one side we have a bunch of enraged people shouting that engineers are all going to lose their jobs. On another side we have a bunch of people shouting there’s nothing to worry about as the current wave of innovation is built on piles of hyped-up… erm… šŸ’©. And then amongst the baying crowds we have incredibly well respected people saying that this is indeed a technology revolution that we all need to invest substantial amounts of time and emotional energy in or we’ll all be left behind. Not a great position to be in really, especially when the technological developments driving this chaos seem to be – at least on the surface level – best described as ‘incredibly powerful unreliable magic’. Of course we can only be talking about AI.

Its at times like this that I feel more than ever that I really need an anchor… something to grab onto that is sure, steady, and unmoving. More often than not I tend to find just such a rock in the understanding that can only come from learning the history of how something has come to be. To that end I set about spending evenings wrapping my head around the current state of AI and the key developments that have led to this point. I have written on this subject for similar reasons about 7 years agošŸ”—, but things have moved so significantly since then it warrants going back to basics again… you know… for the warm, comforting certainty it brings.

So here I present to you a very brief history of…

How we came to arrive in this world of Large Multimodal Models (LLMs)

Method of Least Squares (~1675)

Isaac Newton is credited with first naming the technique ‘method of least squares’ as Linear Regression – though it was very much ‘in development’ for some time. At this point the technique was an experimental range of interpolation methods for finding values between known data points and the method of finite differences in support of astronomy.

Linear Regression (~1805)

Again developed for astronomy, this is the same tool we use today in Excel’s FORECAST.LINEAR. Originally developed by Carl Friedrich Gauss. The now famous ‘slope intercept’ formula underpinning this technique y = mx + b was developed to support fitting orbits to observational data. (I have myself used this formula in implementing a Power Query M Linear Regression algorithm šŸ”—)

Neural Networks (1950s-1980s)

Early capability for pattern recognition for simple classification tasks. Think of a more flexible linear regression tool. This period included the practical development of backpropagation for improved training/learning. Unlike previous rule-based systems, this meant that NNs could learn patterns from data rather than requiring explicit programming.

Deep Neural Networks (DNNs) (2000s)

Enabled by better hardware and backpropagation improvements. Breakthrough use case: image classification (AlexNet 2012). Unlike shallow networks, multiple hidden layers allowed learning hierarchical representations of complex features.

Convolutional Neural Networks (CNNs) (1990s-2010s)

Specialised for spatial data processing. Main use: computer vision tasks. Unlike fully-connected networks, they used local connectivity and weight sharing, drastically reducing parameters while preserving spatial structure.

Recurrent Neural Networks/Long Short-Term Memory (RNNs/LSTMs) (1990s-2010s)

Designed for sequential data processing. Primary use: language modelling, speech recognition. Unlike feedforward networks, they maintained hidden states across time steps, enabling memory of previous inputs in sequences.

Attention Mechanism (2014-2015)

Allowed models to focus on relevant parts of input sequences. Unlike RNNs that processed sequences step-by-step, attention could directly access any position, solving long-range dependency issues.

Generative Adversarial Networks (GANs) (2014)

Adversarial training for generative tasks. Key use: high-quality image synthesis. Unlike previous generative models that maximised likelihood directly, they used a minimax game between generator and discriminator networks.

Transformers (2017)

“Attention is All You Need” paper revolutionised Natural Language Processing (NLP). Main use: language understanding and generation. Unlike RNNs, they removed recurrence entirely, using only attention for both encoding and decoding, enabling massive parallelisation.

Large Language Models (LLMs) (2018-2022)

Scaled transformers (GPT-3, PaLM) showed emergent capabilities. Unlike earlier transformers, massive scale (billions of parameters) and training data led to qualitative leaps in reasoning and few-shot learning abilities.

Large Multimodal Models (LMMs) (2021-present)

Combined vision and language understanding (Claude, CLIP, DALL-E, Flamingo). Current LMMs process multiple modalities simultaneously. Unlike previous models limited to single modalities, they learn unified representations across text, images, and other data types, enabling cross-modal understanding and generation.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.