LARGE SCALE OPENCLIP: L/14, H/14 AND G/14 TRAINED ON LAION-2B

by: Romain Beaumont, 15 Sep, 2022

We trained three large CLIP models with OpenCLIP: ViT-L/14, ViT-H/14 and ViT-g/14 (ViT-g/14 was trained only for about a third the epochs compared to the rest). The H/14 model achieves 78.0% zero shot top-1 accuracy on ImageNet and 73.4% on zero-shot image retrieval at Recall@5 on MS COCO. As of September 2022, this is the best open source CLIP model.

CLIP makes it possible to compute representations of images and texts to measure how similar they are. It can be used for

Zero shot classification: compare an image with the text of the class to know which class is most similar (e.g., ImageNet classification)
Retrieval: compare an image or a text to billions of text or images to find the most similar (e.g. as in clip-retrieval )
Generation
- CLIP guidance: decide a text you want to generate, then use an image generator model, and use the CLIP distance between what’s generated and the text to generate a better image (e.g., VQGAN + CLIP)
- CLIP conditioning: use a clip text embedding as input of a generator to make it generate this text directly (e.g., stable diffusion)

CLIP models are trained in a self supervised fashion on hundreds of millions or billions of (image, text) pairs.

With LAION, we produced the LAION-5B dataset that contains 5.8 billions of closely related image and text pairs.

The CLIP model ViT B/32, released by OpenAI, was initially used to filter this dataset out of common crawl.

Producing the best open source CLIP model out of this data set completes the open source replication of the excellent CLIP paper that OpenAI released one year ago.

Results

We replicated the results from openai CLIP in models of different sizes, then trained bigger models. The full evaluation suite on 39 datasets (vtab+) are available in this results notebook and show consistent improvements over all datasets.

The larger models we release today are L/14, H/14 and g/14.

L/14 was trained on JUWELS Booster supercomputer by Ross wightman. H/14 and g/14 were trained on stability cluster by Romain Beaumont . While L/14 and H/14 were trained using 34B samples from LAION-2b, g/14 used a substantially smaller sample scale for training, seeing only 12B samples (see tables for more details).

32B samples seen

Model name	Batch size	Samples seen	Text Params	Image params	Imagenet top1	Mscoco image retrieval at 5	Flickr30k image retrieval at 5
B/32	79k	34B (16 epochs of laion2B)	63.43M	87.85M	66.6%	65.4%	88.4%
L/14	79k for 14B samples, 86K for 18B	32B	123.65M	303.97M	75.3%	71.1%	92.9%
H/14	79k	32B (16 epochs of laion2B)	354.03M	632.08M	78.0%	73.4%	94%

12B samples seen

Model name	Batch size	Samples seen	Text Params	Image params	Imagenet top1	Mscoco image retrieval at 5	Flickr30k image retrieval at 5
B/32	32k	12B (32 epochs of laion400m)	63.43M	87.85M	62.9%	60.8%	85.5%
B/16	32k	12B (32 epochs of laion400m)	91.16M	86.19M	69%	63.6%	85.5%
L/14	32k	12B (32 epochs of laion400m)	123.65M	303.97M	72%	68.1%	90.8%
g/14	32k for 8B samples then 64k for 4B samples	12B (similar to 32 epochs on laion400m)	354.03M	1012.65M	76.6%	72.4%	93.5%

In addition to having overall better results, we hope the larger text encoder will help improve text understanding. The good performance on the retrieval metrics seems to be a good indicator of this property.

Note the difference in samples seen between the H/14 and the g/14 model. This explains the difference in performance. We picked this lower number to try and fix the stability issue at a lower cost. Eventually they were fixed (by using bfloat16). The performance of this model falls in the scaling curve of 12B sample seen (similar to 32 epochs of laion400m), and a g/14 trained on 32B samples of laion2B would most likely follow the same trends as the other models and get better performance as H/14.

alt_text

Released checkpoints

We release the checkpoints for the models, they are available through openclip and in HuggingFace hub at B/32 L/14 H/14 and g/14

Related works

Related work results:

Model name	Samples seen	Imagenet top1	Mscoco image retrieval at 5	Flickr30k image retrieval at 5
Openai B/32	12B (32 epochs of WIT)	62%
Openai B/16	12B (32 epochs of WIT)	69%
Openai L/14	12B (32 epochs of WIT)	75.4%	61%	87%
ALIGN	20B	76.4%	69.8%	93.3%
BASIC	32B	85.7%
CoCa	32B	86.3%	74.2%	95.7%

BASIC and ALIGN got excellent imagenet results. They used either different image encoder architecture (EfficientNet, CoAtNet), a larger network scale (BASIC-L with 2.4B params) or pre trained their network with supervised learning on a large dataset (BASIC CoAtNet vision encoder).

COCA additionally used captioning loss during training with a multi-modal text decoder which predicted text tokens autoregressively and got 86.3% top1, employing a larger model scale (2.1B params)

Scaling up notes

During these training runs, we encountered several interesting issues:

Using many GPUs means many of them can have hardware issues and can freeze, crash or even just be slow. This is a particularly annoying problem to handle as if one GPU has an issue, the synchronized nature of distributed training means that all GPUs get stuck. I created https://github.com/rom1504/gpu-tester to figure out what are the bad GPUs and exclude them
Stability issues! When scaling up the model size, the batch size and the dataset size, at around half the training the loss starts increasing until it reaches a plateau. We tried many possible things (find the list there) and eventually concluded on a surprisingly simple solution: using amp bfloat16 instead of amp float16 made the training fully stable

And also made some discoveries:

It seems using a very large batch size (up to 159k) can help reach even higher performance. This is most likely due to the fact that contrastive learning provides information to the loss as a logit matrix, hence having N times more samples in a batch means N square logits. We did not verify this systematically but BASIC paper provides more experiments and a theoretical justification for this result.
It’s possible to get a reasonably performing g/14 CLIP by doing a much shorter cosine decay => getting a 68% g/14 in 10k gpu hours.
Grad checkpointing allows to do 10x on the batch size

Training stability issues

Stability of training was the main problem we solved in this iteration of the scaling up of OpenCLIP. At around half the training (for L/14, H/14 and g/14), the loss started going up until it plateaued very high (11) and didn’t go down anymore.

We tried many possible fixes (decreasing lr, gradient shrinking, gradient clipping, cosine attention, post layer norm, …) with little to no effect when trying to resume from before the crash.

Eventually only 2 things worked:

Finishing the lr decay very fast : in 8 epochs (compared to the planned 256 epochs). That managed to get most of the performance out of clip H.
Switching from float16 to bfloat16 solved the problem while being faster for clip g. We then applied the same fix for clip H and finished its training properly.

See all the training notes with all the details on all the possible ideas that didn’t work.

Training speeds

To better understand the cost and length of training of clip, we provide these training speed numbers. All numbers assume a100 with 40GB of VRAM. We used gradient checkpointing.

Model	Batch size per gpu	Precision	Number of gpus	Sample per second per gpu
B/32	96	float16	824	228
H/14	96	float16	824	30
g/14	40	float16	800	20
H/14	96	bfloat16	824	42
g/14	80	bfloat16	800	31

The speed usually increases with batch size per gpu until a plateau is reached. The speed also increases with the number of gpu. After a certain number of gpus, the curve becomes slower than linear.

Bfloat16 which we used in the second part of training provides both better stability and faster sample/s for clip models.

What’s next

The models will be used for many applications, including clip guiding and conditioning. Even better results could be reached on models like stable diffusion by using a better clip model!

Now that the scaling properties of clip are proven in an open source reproduction, a lot of doors open. Here are some ideas of next steps:

Changing the text encoder to work in the multilingual setting (to get a model like Multilingual-CLIP but trained contrastively, with hopefully even better results!) and scale it up
Can we get clip models while using less gpu hours ? extracting the knowledge from smaller clips into a bigger one may help bootstrap the learning process (see encoder-distill from iejMac getting some preliminary results on this)
The clip idea can be expanded to other modalities, see CLAP for text-audio alignment

If you have ideas or want to help out, feel free to reach out in laion server.

Contributions

Thanks to

Romain Beaumont for running the experiments on H/14 and g/14
Ross Wightman for conducting all the openclip experiments at JUWELS Booster (Juelich Supercomputing Center) up to L/14 and providing valuable feedback during these H and g clip trainings
Phil Wang for providing ideas and code (cosine attention, post layer norm, ..) during the stability issues
Boris Dayma and Mitchell Wortsman for both proposing to try float32 that showed precision was an issue and eventually lead to trying bfloat16
Blinkdl for proposing interesting ideas regarding tuning the learning rate
Christoph Schuhmann for daring proposing to train such large clips, following up on all these experiments, and finding very early that training were frozen, saving some valuable time
Jenia Jitsev for providing ideas and feedback during the training issues, supervision and coordination of the compute grants at JUWELS Booster
Ludwig Schmidt for reviewing this post and giving many ideas about LAION datasets and CLIP
Mehdi Cherti for helping to debug the evaluation scripts and getting comparable results for MS-COCO

And of course Emad (Stability AI) for providing the many GPUs used during these experiments! (g/14 and H/14!)

For the L/14 training, we gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC), Germany.