LAION TRANSLATED: 3B CAPTIONS TRANSLATED TO ENGLISH FROM LAION5B

by: Marianna Nezhurina, Romain Beaumont, Richard Vencu and Christoph Schuhmann, 15 Sep, 2022


Author: Marianna Nezhurina Romain Beaumont Richard Vencu Christoph Schuhmann

Laion5B dataset was automatically collected from a section of the human web (common crawl). Can models generate different and interesting data compared to what humans write?

That’s a question we are interested in investigating. To let the community study it, we translated 3B samples of Laion5B from many languages into English.

We released 3 billions captions for the multilingual part of Laion5B. This makes it possible to use the whole Laion5B dataset to train English models. This also enables training models using these aligned pairs such as Multilingual-CLIP.

We’re curious what you will do using it!

Downloading it

The dataset is available in huggingface as parquet files containing the caption, translated caption and urls.

laion1B-nolang-joined-translated-to-en

laion2B-multi-joined-translated-to-en

Processing

Every caption of the original dataset was translated with Facebook’s M2M100 1.2B model using the following script. All other fields remain the same as in the original LAION2B Multi Joined. To make translation possible the original dataset was split into parts with 50k samples in each and every such small part was translated in parallel on GPU nodes and saved in a separate parquet file. The speed of translation depends on the number of nodes and GPUs. The processing was done with 20 nodes with 8 GPUs in each and the speed of translation (including preprocessing and data loading) is 34 samples/per GPU/per second. Then, all translated parquets were merged together using Spark and saved as 128 parquet files. The resulting dataset was joined with the aesthetics scores.

Dataset columns

  • TEXT (the original text of caption)
  • LANGUAGE (language of the original TEXT)
  • ENG TEXT (translation in English of the original TEXT)
  • URL (URL of the image)
  • WIDTH (width of the image)
  • HEIGHT (height of the image)
  • Hash (hash of the URL and TEXT)
  • Pwatermark (probability of being a watermarked image, computed using our watermark detector)
  • Punsafe (probability of being an unsafe image, computed using our clip based detector)
  • Similarity (cosine between text and image ViT-B/32 embeddings, clip for en, mclip for multi and nolang)
  • Prediction (aesthetics score)

Samples from the translated dataset:

laion-2B-translated-samples

Laion2B-multi-translated

Dataset stats

Note. Dataset stats were computed using this script.

Number of uniques 2266M (2266193302)

Number with WIDTH >= 0 and WIDTH <= 128 160M (160260569)

Number with WIDTH >= 128 and WIDTH <= 256 734M (734166164)

Number with WIDTH >= 256 and WIDTH <= 512 849M (849569769)

Number with WIDTH >= 512 and WIDTH <= 1024 457M (457572747)

Number with WIDTH >= 1024 86M (86750813)

Number with HEIGHT >= 0 and HEIGHT <= 128 103M (103514467)

Number with HEIGHT >= 128 and HEIGHT <= 256 614M (614490681)

Number with HEIGHT >= 256 and HEIGHT <= 512 753M (753540968)

Number with HEIGHT >= 512 and HEIGHT <= 1024 686M (686553437)

Number with HEIGHT >= 1024 153M (153139456)

Number with lenengtext >= 0 and lenengtext <= 25 506M (506238532)

Number with lenengtext >= 25 and lenengtext <= 50 849M (849160165)

Number with lenengtext >= 50 and lenengtext <= 100 840M (840635023)

Number with lenengtext >= 100 and lenengtext <= 150 136M (136709119)

Number with lenengtext >= 150 5M (5148507)

Similarities between text and images

10000 images and captions were sampled from the dataset, CLIP embeddings were computed (for original texts embeddings were computed using Multilingual CLIP). Then dot products between image and text embeddings were computed (for both original and translated dataset) to get similarities between texts and images. Here’s the distribution of average similarities for two datasets:

Similarity for original dataset:

10% quantile - 0.2552971839904785

20% quantile - 0.2633610963821411

30% quantile - 0.2694466710090637

40% quantile - 0.2750270366668701

50% quantile - 0.28088638186454773

60% quantile - 0.28750720620155334

70% quantile - 0.2950591444969177

80% quantile - 0.3049575388431549

90% quantile - 0.32077282667160034

Similarity for translated dataset:

10% quantile - 0.23388671875

20% quantile - 0.25390625

30% quantile - 0.265869140625

40% quantile - 0.2763671875

50% quantile - 0.2861328125

60% quantile - 0.29638671875

70% quantile - 0.306884765625

80% quantile - 0.31982421875

90% quantile - 0.338134765625

laion-2B-translated

Laion1B-nolang-translated

LAION1B Nolang Joined dataset was also translated with a similar script and in the same way as Multi. This dataset doesn’t have the language column so the model also had to determine language. All other columns in the Nolang dataset are the same as in Multi. And just like translated LAION2B Multi Joined, LAION1B Nolang was also joined with corresponding aesthetics scores.

Nolang dataset Stats

Number of uniques 1260M (1260048307)

Number with WIDTH >= 0 and WIDTH <= 128 90M (90701133)

Number with WIDTH >= 128 and WIDTH <= 256 409M (409575445)

Number with WIDTH >= 256 and WIDTH <= 512 475M (475885337)

Number with WIDTH >= 512 and WIDTH <= 1024 239M (239035772)

Number with WIDTH >= 1024 59M (59942110)

Number with HEIGHT >= 0 and HEIGHT <= 128 59M (59814914)

Number with HEIGHT >= 128 and HEIGHT <= 256 370M (370913206)

Number with HEIGHT >= 256 and HEIGHT <= 512 451M (451897702)

Number with HEIGHT >= 512 and HEIGHT <= 1024 316M (316723245)

Number with HEIGHT >= 1024 87M (87671543)

Number with lenengtext >= 0 and lenengtext <= 25 312M (312548202)

Number with lenengtext >= 25 and lenengtext <= 50 555M (555971621)

Number with lenengtext >= 50 and lenengtext <= 100 413M (413430230)

Number with lenengtext >= 100 and lenengtext <= 150 20M (20446701)

Number with lenengtext >= 150 0M (138974)

Similarities between text and images for Nolang

10000 images and captions were randomly sampled from the dataset, CLIP embeddings were computed (for original texts embeddings were computed using Multilingual CLIP). Then dot products between image and text embeddings were computed (for both original and translated dataset) to get similarities between texts and images. Here’s the distribution of average similarities for two datasets:

Similarity for original dataset:

10% quantile - 0.258196085691452

20% quantile - 0.266357421875

30% quantile - 0.2728866934776306

40% quantile - 0.27902457118034363

50% quantile - 0.28590404987335205

60% quantile - 0.29329144954681396

70% quantile - 0.3023602366447449

80% quantile - 0.31363412737846375

90% quantile - 0.3313804566860199

Similarity for translated dataset:

10% quantile - 0.2406005859375

20% quantile - 0.2607421875

30% quantile - 0.27490234375

40% quantile - 0.2861328125

50% quantile - 0.296142578125

60% quantile - 0.306396484375

70% quantile - 0.317626953125

80% quantile - 0.33203125

90% quantile - 0.353271484375

laion-1B-translated

Credit

We thank https://stability.ai/ for providing the compute for this massive translation. This was a great use of pre-emptible jobs to fill any idle compute available!