by: Mitchell Wortsman, 24 Jan, 2023

We have trained a new ViT-G/14 CLIP model with OpenCLIP which achieves 80.1% zero-shot accuracy on ImageNet and 74.9% zero-shot image retrieval (Recall@5) on MS COCO. As of January 2023, this is the best open source CLIP model.

We believe this is interesting because:

  • CLIP models are useful for zero-shot classification, retrieval, and for guidance/conditioning in generative models (OpenCLIP is used in Stable Diffusion V2 and currently the third most downloaded model on HuggingFace is a CLIP model). The approach underlying CLIP—self supervised learning on a large, heterogeneous dataset—has been shown to produce models which are more robust and fair.
  • Our new ViT-G model achieves the highest zero-shot ImageNet accuracy among models that use only naturally occurring image-text pairs as training data, and without explicit labels, pseudo-labels, or any pretrained image or text encoders.
  • Our training run utilized multiple new techniques, including FLIP to accelerate training and model soups to surpass 80% accuracy.

Main Results

The following results are with image resolution 224x224 except for CoCa which uses 576x576.

Model name Batch size Samples seen Text Params Image params ImageNet top1 Mscoco image retrieval at 5 Flickr30k image retrieval at 5
OpenAI CLIP L/14 32k 13B 123.65M 303.97M 75.4% 61.0% 87.0%
OpenCLIP H/14 79k 32B (16 epochs of laion2B) 354.0M 632.08M 78.0% 73.4% 94%
OpenCLIP G/14 160k 32B +unmasked fine-tune (details below) 694.7M 1844.9M 80.1%* 74.9% 94.9%
CoCa 66k 33B 1100M 1000M 86.3%** 74.2 95.7

* When using CuPL prompts instead of the standard prompts from OpenAI, the zero-shot accuracy is 80.3%. When evaluating at 280x280 and changing resize to squash, Ross Wightman found the model achieves 80.4%.

** In addition to natural language supervision, CoCa uses synthetic captions constructed with the labels from the JFT-3B dataset. In addition to natural language supervision, CoCa uses synthetic captions constructed with the labels from the JFT-3B dataset. 973 of the 1,000 ImageNet classes have a corresponding class in JFT (e.g., see here sec C.7.2).

Also see the figure below (figure code by Ross) and our analysis of scaling trends for OpenCLIP model here.

Released Checkpoints

We release the checkpoint through OpenCLIP and in the HuggingFace hub.

Notes on scaling up

To scale up model size while reducing compute we used Fast Language-Image Pre-training (FLIP) with patch dropout 0.5. Similar to Masked Autoencoders (MAE), FLIP drops out patches during training. FLIP also requires a short “unmasked tuning” phase, which we discuss in training notes below. In addition to reducing Giga multiply–accumulate operations (GMACs) for each forward/backward pass, FLIP allowed us to use a larger per-GPU batch size. Without FLIP, gradient accumulation was necessary to maintain a large batch size. Keeping batch size and number of GPUs consistent (at 160k and 512, respectively) but switching to unmasked fine-tuning resulted in a drop from 46.9 to 20.4 samples per second per GPU. For reference, OpenCLIP H/14 with global batch size 79k across 824 GPUs without patch dropout trained at 42 samples/s/GPU.

To scale up the batch size to 160k, we used gradient checkpointing and 80GM VRAM A100s. For the unmasked tuning portion, we also used gradient accumulation (see our implementation for the contrastive objective here). Finally, we used a 2x higher learning rate of 2e-3 compared to our experiments with batch size 80k. The combination of scaling up model, batch size, and learning rate resulted in training instability during the warmup phase. Accordingly, we increased warm-up to 13k steps, trained with layer scale, and used AdamW beta2 0.95. All runs used AMP bfloat16, after previously switching from float16 in prior experiments with L/14 and H/14.

Training notes

Phase 1: Patch dropout

For phase 1 we trained ViT-G with patch dropout 0.5 on LAION-2B for 32B samples seen. We used batch size 160k, learning rate 2e-3, and a cosine decay schedule. After this phase the model reached 79.07 zero-shot top1 accuracy on ImageNet.

Training was mainly done on 512 to 760 A100s depending on availability. When changing the number of GPUs, local batch size was also modified so that the global batch size remained at 160k. When using 512 GPUs we set local batch size to 313 and observed roughly 24k samples per second or 46.9 samples/s/GPU. When using 760 GPUs we set local batch size 211 and observed roughly 33k samples per second or 43.4 samples/s/GPU.

Phase 2: Unmasked tuning + Model soups

For phase 2 we followed FLIP in conducting a short unmasked tuning phase. We fell short of 80% in our first unmasked fine-tuning phase, reaching only 79.43%. So we tried twice more with different settings (described below) to obtain 79.45% and 79.2%, respectively. Next, we followed model soups and averaged the weights of three checkpoints produced by these runs to achieve our final accuracy of 80.1%. LIMoE and PaLI also used model soups for better pre-training.

For our first unmasked fine-tuning run we did not modify the learning rate schedule, but instead doubled the base LR and extended the number of iterations so that the run would proceed for an additional 2B samples seen. LR started at 3.8e-5. For the second run we used LR 5.5e-5 with a full cosine schedule (warmup for roughly 200M samples and a total of 4B samples). The third run had identical hyperparameters to the first but used the LAION-A subset of LAION-2B. LAION-A is a 900M subset of LAION-2B filtered with aesthetic V2 4.5+ and pHash deduplicated. Instead of waiting for the third run to complete we use the checkpoint after approximately 700M samples which, when “souped” with the final checkpoints from the two proceeding runs, already allowed us to surpass our goal of 80% accuracy. This indiviual checkpoint achieved 79.2%.

Unmasked fine-tuning was done on 512 A100 GPUs at a speed of roughly 10,450 samples/s or 20.4 samples/s/GPU.

The following plot shows the loss curve for phase 1.

More results

Zero-shot accuracies at resolution 224x224 computed with CLIP Benchmark.

Dataset OpenCLIP H/14 OpenCLIP G/14
ImageNet 78.0 80.1
ImageNet-V2 70.8 73.6
ImageNet-R 89.3 92.1
ImageNet-Sketch 66.6 68.9
ObjectNet 69.7 73.0
ImageNet-A 59.2 69.3
CIFAR-10 97.4 98.2
CIFAR-100 84.7 87.5
MNIST 72.9 71.6
SVHN 56.1 62.5
Caltech-101 85.0 86.4
SUN397 75.2 74.5
FGVC Aircraft 42.8 49.7
Country211 30.0 33.8
Cars 93.5 94.6

Here is a summary figure comparing G/14 and H/14 made with evals by Romain Beaumont.

What’s Next?

In the future, we may fine-tune the model to enable multilingual capabilities, or fine-tune at higher resolution. Also, FSDP is coming to OpenCLIP which will allow even larger models, as is CoCa which will allow new openclip models to also be captioners. More contributions to OpenCLIP are always welcome!

Contributions and acknowledgements

Thanks to:

And of course thanks to Emad and Stability AI for providing the compute resources used for these experiments.