The Fundamental AI Research (FAIR) team at Meta recently released Chameleon, a mixed-modal AI model that can understand and generate mixed text and image content. In experiments rated by human judges, Chameleon's generated output was preferred over GPT-4 in 51.6% of trials, and over Gemini Pro in 60.4%.
Chameleon differs from many mixed image-text AI models, which use separate encoders and decoders for the two modalities. Instead, Chameleon uses a single token-based representation of both text and image, and is trained end-to-end on mixed sequences of both image and text. Meta trained two sizes of the model: Chameleon-7B, with seven billion parameters, and Chameleon-34B with 34 billion. Both models were pre-trained on over four trillion tokens of mixed text and image data, and then fine-tuned for alignment and safety on smaller datasets. In addition to human judges comparing the model's output to that of baseline models, Meta evaluated Chameleon-34B on several benchmarks, noting that Chameleon achieved state-of-the-art performance on visual question answering and image captioning tasks. According to Meta,
By quantizing images into discrete tokens and training on mixed-modal data from scratch, Chameleon learns to jointly reason over image and text in a way that is impossible with...models that maintain separate encoders for each modality. At the same time, Chameleon introduces novel techniques for stable and scalable training of early-fusion models, addressing key optimization and architectural design challenges that have previously limited the scale of such approaches...Chameleon also unlocks entirely new possibilities for multimodal interaction, as demonstrated by its strong performance on our new benchmark for mixed-modal open-ended QA.
The Meta team noted that it was "challenging" to train Chameleon when scaling above 8B model parameters or more than 1T dataset tokens, due to instabilities. The researchers had to make changes to the standard Transformer architecture to resolve these challenges; in particular, they found that because the model weights were shared across both input modalities, "each modality will try to compete with the other" by increasing its vector norms, eventually moving the norms outside the range of the 16-bit floating point representation used in the model. The solution was to apply additional normalization operations into the model architecture.
Chameleon Training and Inference. Image Source: Meta Research Paper
Autoregressive output generation with Chameleon also presented "unique" performance challenges. First, generating mixed-mode output requires a different decoding strategy for each mode, so output tokens must be copied from GPU to CPU for program control flow. Also, since the model can be asked to generate single-mode output (for example, text-only), this requires the model to have the ability to mask or ignore tokens of other modalities. To solve these problems, Meta implemented a custom inference pipeline for the model.
In a thread on X about the research, Chameleon co-author Armen Aghajanyan wrote:
One core learning we had with Chameleon is that the intended form of the modality is a modality in itself. Visual Perception and Visual Generation are two separate modalities and must be treated as such; hence, using discretized tokens for perception is wrong.
In another thread, AI researcher Nando de Freitas noted that Chameleon's architecture is similar to that of DeepMind's Gato model and wondered, "Is this going to be the ultimate approach for MIMO (multimodal input multimodal output models) or is there something else we should be trying?"
While Meta did not publicly release Chameleon, citing safety concerns, they released a modified version of the models that support mixed-mode inputs but cannot generate image output. These models are available on request for use under a "research-only license."