We’re pleased to announce the World's first Large Competitive Debate Dataset: LAION-Debate. LAION-Debate is a large Competitive debate dataset providing links to Competitive Debate Championships, discussions and prominent speakers intake and conversations posted on YouTube by University of Cambridge and University of Oxford through their Cambridge and Oxford Union Debate clubs on their affiliated channels.

Competitive Debate datasets are scarce and hard to find in the public domain. Because these datasets are either gated by individuals and institutions who generate them or not archived properly enough to form them into a dataset. Hindering the ability to use them for Artificial Intelligence research.

In an era, where datasets are being scarce and the large AI models are exhausting entire human knowledge and depleting known data sources, Debate 2B encourages to use alternative credible sources and other forms of knowledge corpus that provides a unique outlook and understanding than the mainstream.

Today, a community member of LAION (tawsif) released this novel dataset on Competitive Debate in the field of Natural Language Processing.

## What’s Competitive Debate?

Competitive Debate is a sport where speakers of widely different backgrounds engage in discussions on relevant motions (subject matter). Subject matters include but are not limited to Philosophy, Politics, Historical Debate, Logical fallacy, morality and ethics, Science and Technology.

Speakers engage into these discussions from two sides, one in support of the subject matter and another against the subject matter and use speculative language, tone, logical traps, well-constructed sentences to reflect their intent and other strategies to convince the judge and audience for their school of thought.

Both sides of the spectrum include knowledgeable speakers well-versed in the subject matter and eloquent in their words and then engage into these discussions to convince the judges and audience their school of thought to be justified. In this sport, most knowledgeable and convincing speakers end up winning rather than those stating facts.

It’s a sport where logic and art of speech meet together in perfect harmony.

## Characteristic of Debate 2B

Debate 2B is largely a collection of YouTube links pointing towards the championship and discussion videos posted by University of Oxford and University of Cambridge on their official affiliated channels. Most of these speeches are either British Parliamentary speeches or interviews taken by aforementioned universities’ students of prominent and significant characters.

Although these interviews conducted at both the Oxford Union and Cambridge Union are widely different from what we public view on Sky News and CNN. Because these interviews are conducted by individuals well-versed in the art of speech while having a neutral opinion whilst conducting the interviews. Making sure relevant questions are being addressed and most truest opinions are extracted from the interviewee without any intent of sensationalising the opinions expressed by the interviewee.

## Intent fields and research routes

Debate 2B is intended to represent Natural language processing as the primary field. Although, we understand it can be used in the context of Computer Vision and Reinforcement learning too.

Debate 2B provides two datasets captured into one. Audio and textual form datasets. Audio datasets can be used to fine-tune large pretrained audio generation models to generate audio that sounds logical and emotional. Because these speakers used emotions and logical tone to convey their message and convince their audience of their school of thought.

Similarly, textual datasets provide an in-depth outlook into a new form of text generation datasets. That is backed by facts and how these facts and sentences should be structured to provide logical reasoning. We believe Debate 2B is the first dataset able to provide logical reasoning built-in within the dataset.

**Note**: We don’t provide the textual form of this dataset yet.

## Metadata and info of Debate 2B

We provide links to 2,700 hours of audio recordings; which accounts for 130GB for highest bitrate and 40GB for lowest possible bitrate for these recordings.

Cambridge Union links dates between 19th May, 2011 - 2nd of June, 2024
Oxford Union links between 6th of September - 12th of July, 2024

## Licence

It is hosted under Apache 2.0.

## Downloading the dataset

Debate 2B links can be found on Hugging Face. Its access is gated and only academic and work emails are being allowed at the moment to ensure safety. Audio recordings of Debate 2B can be found on Kaggle.

<https://huggingface.co/datasets/sleeping-ai/LAION-Debate>
<https://www.kaggle.com/datasets/sleepingcat4/cambridge-2b>
<https://www.kaggle.com/datasets/sleepingcat4/oxford-2b>

## Acknowledgement

We acknowledge our LAION community member tawsif who created the dataset and made its audio recordings and links to the audio recordings public.

<https://github.com/sleepingcat4>
Email: <tawsif.ahmed@science.ru.nl>


LAION-Debate: dataset of competitive debates and discussions


Technologies like the recently introduced GPT-4-OMNI from OpenAI show again the potential which strong multi-modal models might have to positively transform many aspects of our lives. A particularly impressive example of this is in the field of education. Imagine every person in the world having their own personal learning assistant that acts like a attentive, caring, patient, and empathetic tutor. The demo from OpenAI last Monday showed that such a vision of the future is not too far off and is within reach.

## The Path to Open Multi-Modal Models

An important milestone on this path could be training an open-source model with capabilities similar to GPT-4-OMNI. The first step would be to fine-tune an existing large language model so that it can natively understand and process audio in the same way large language models currently handle text. Simultaneously, this model should be able to generate audio natively, just as it can currently output and manipulate text.

This approach had been shown to work in the [AudioPalm paper](https://arxiv.org/abs/2306.12925):

![Audio Palm Pipeline](/images/blog/gpt-4-omni-1.png)

A promising approach to achieving this is converting audio signals into discrete tokens using codecs like SNAC. SNAC allows audio signals to be converted into about 80 tokens per second, enabling the language to be reconstructed in very high quality. For music, sound effects, and other general-purpose audio, other versions of SNAC demand around 200 tokens per second, enabling detailed understanding and generation of these domains. As a proof of concept, the initial goal would be to tune a large language model to process both text and audio tokens, with the 24kHz version of SNAC optimized for speech being a good starting point.

SNAC (Multi-Scale Neural Audio Codec) compresses audio into discrete codes at a low bitrate, setting itself apart from other codecs like SoundStream, EnCodec, and DAC through its hierarchical token structure. This structure samples coarse tokens less frequently, covering a broader time span, which saves on bitrate and is particularly useful for language modeling approaches to audio generation.

![Audio Palm Pipeline](/images/blog/gpt-4-omni-2.png)

 For instance, with coarse tokens of ~10 Hz and a context window of 2048, SNAC can effectively model the consistent structure of an audio track for up to three minutes. SNAC offers different types of codecs optimized for specific use cases: the 24 kHz version is tailored for speech, while the 32 kHz and 44 kHz versions are designed for general-purpose audio, including music and sound effects. This versatility and efficiency make SNAC an advantageous choice for integrating audio processing capabilities into large language models.

Additionally, SNAC can flatten its hierarchical structure segment-wise for each coarse token, allowing segments of approximately ~100 ms to be decoded individually and later reassembled. This depth-first flattening method facilitates low-latency streaming, making it possible to stream high-quality audio in near real-time ( [Tutorial](https://youtu.be/NwZufAJxmMA?si=WVA2H05m3xypRncc) ).

![Audio Palm Pipeline](/images/blog/gpt-4-omni-3.png)

Notebooks about how to use SNAC:

| SNAC Tokenization |
| --- |
| [24kHz Speech Version](https://colab.research.google.com/drive/11qUfQLdH8JBKwkZIJ3KWUsBKtZAiSnhm?usp=sharing) |
| [32kHz General Purpose Version](https://colab.research.google.com/drive/1g1H0bBWRhKzHutCJZNxtavpRamw1uaXr#scrollTo=pBiT7Jx6rxmm) |

To advance research in this area, we have converted the [parler-tts/mls-eng-10k-tags_tagged_10k_generated dataset](https://huggingface.co/datasets/blanchon/snac_llm_parler_tts) into 24kHz SNAC tokens.

## SNAC Tokenized Dataset

We call upon the community to experiment with pretraining large language models using these tokens. The first step would be to get an existing open-weights model like Llama, Mistral, Dbrx, Qwen, StableLM 2 or Phi-3  to generate SNAC tokens from text transcriptions and descriptions, functioning like a text-to-speech model. Once this works well, the next step would be training the model to see various text data simultaneously, retaining its text generation and understanding capabilities while acquiring the ability to generate audio tokens in response to questions or instructions.

This way, the model could be asked a question in text and provide an answer in SNAC tokens, which could then be directly decoded into spoken language. It would also be interesting to see how well even a small scale  LLM, such as Phi-3 or Qwen-1.8B, could transcribe speech by feeding it SNAC tokens and generating a transcription text. The next step would be to train a chat model that understands SNAC tokens as input and responds with text, or directly responds with SNAC tokens to text inputs.

Once we can reliably perform functions like transcribing audio segments and generating speech in response to user queries or text inputs while maintaining the LLMs' ability to generate and understand text, we can consider extended pretraining. This involves training language models on a mixture of high-quality texts and SNAC tokens from complete, longer audio recordings. There are many publicly available sources of high-quality audio data that could impart more nuances and linguistic subtleties to the LLM than currently possible with existing ASR and TTS datasets. After extended pretraining with both text and audio data, we need instruction fine-tuning with audio-to-audio instruction datasets, where both the instruction and fulfillment are provided in audio tokens.

## Audio-to-Audio Instruction Tuning Datasets

As potential sources for extended pre training of LLMs, we collected video links from sources like common crawl.

[High quality podcasts, lectures & shows (330657)](https://huggingface.co/datasets/laion/links_to_pocasts_lecture_and_shows_for_tts)

For initial tests, it would be beneficial to generate both the instruction and its execution  through the chatbot using TTS systems. First, we create a conventional instruction tuning dataset with a text-based LLM and then generate audio files for both the user's and the chatbot's roles with different voices. These are then converted into SNAC tokens or other audio tokens.

If this type of instruction tuning proves successful, a theoretically feasible but limited approach could be to generate an instruction tuning dataset with volunteers where one person acts as the user and another as the chatbot.

Another possibility is to perform transcription with speaker separation on podcasts, and then use an LLM like LLAMA to identify transitions where speaker 1 appears to issue a request and speaker 2 helpfully responds. These parts from speaker 1 and speaker 2 could be components in an audio-to-audio instruction tuning dataset.

Additional ideas for audio text tuning datasets are:

- Integrated Audio-Text Datasets: Create datasets where text segments are partially replaced with speech segments generated using Text-to-Speech (TTS) systems. This method helps the model learn to handle interleaved audio and text seamlessly.
- Cross-Modal Translation Tasks: Use models like Meta's SeamlessM4T to generate speech translations from one language to another. For instance, translate English audio clips to German, creating paired datasets to enhance the model’s multilingual audio capabilities.
- Music and Sound Effects Generation: Develop datasets containing music and sound effects with corresponding textual descriptions or generation instructions. This trains the model to understand and generate diverse audio outputs based on text or audio inputs.

## Conclusion

As a community of volunteers and hobbyists, we cannot conduct all these experiments simultaneously. Therefore, we officially call on the open-source community to start experimenting with the datasets we have converted and share their results with us. Once we achieve promising small-scale results and eventually derive scaling laws based on the small scale experiments predicting behavior on larger scales, we can discuss how to provide computing resources for larger-scale experiments.

We look forward to your feedback and experiments. Together, we can create a future where advanced language models are accessible to all and have a positive impact on many lives.


[Join our discord server](https://discord.com/invite/WugQF4YeT6)


Call to Build Open Multi-Modal Models for Personal Assistants


There have been reports in the press about the results of a research project at Stanford University, according to which the LAION training set 5B contains potentially illegal content in the form of CSAM. We would like to comment on this as follows:

LAION is a non-profit organization that provides datasets, tools and models for the advancement of machine learning research. We are committed to open public education and the environmentally safe use of resources through the reuse of existing datasets and models.

LAION datasets (more than 5.85 billion entries) are sourced from the freely available Common Crawl web index and offer only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-9 describe the specific measures we took for filtering CSAM related material.

LAION collaborates with universities, researchers and NGOs to improve these filters and are currently working with the [Internet Watch Foundation (IWF)](https://www.iwf.org.uk/) to identify and remove content suspected of violating laws. LAION invites the Stanford researchers to join its Community to improve our datasets and to develop efficient filters for detecting harmful content.

LAION has a zero tolerance policy for illegal content and in an abundance of caution, we are temporarily taking down the LAION datasets to ensure they are safe before republishing them.

Following a discussion with the Hamburg State Data Protection Commissioner, we would also like to point out that the CSAM data is data that must be deleted immediately for data protection reasons in accordance with Art. 17 GDPR.


Safety Review for LAION 5B



## **Introduction**

Large language models (LLMs), such as OpenAI's ChatGPT and similar chatbot products from other organizations, have recently gained widespread adoption. These models can extend text or respond to instructions in a natural and helpful manner. Despite the core technologies behind LLMs, namely the transformer architecture and the GPT decoder-only causal language model, remaining relatively unchanged for over five years, the surge in popularity of ChatGPT can be largely attributed to recent approaches that better align the output of LLMs with users' and service providers' intentions.


Two primary approaches have been employed to better align large language models with human expectations. The first is known as supervised finetuning (SFT) on natural instructions, while the second is called reinforcement learning from human feedback (RLHF). Both methods aim to improve the performance and usability of LLMs, but they differ in their implementation. SFT involves training the model using labeled datasets that contain natural instructions, which helps the model understand and respond more accurately to user queries. RLHF, on the other hand, is a technique that uses human preferences as a reward signal to fine-tune models. It involves collecting a dataset of human-written demonstrations on prompts, training supervised learning baselines, and then gathering a dataset of human-labeled comparisons between two model outputs on a larger set of prompts. A reward model (RM) is trained on this dataset to predict which output labelers would prefer, and this RM is used as a reward function to fine-tune the LLM using the PPO algorithm. However, there is an "alignment tax" associated with this approach, which can result in worse performance in some situations.

![cond_pretrain_im1](https://github.com/LAION-AI/laion.ai/assets/22318853/77ce9e7d-4bdb-4fd4-b0fe-0a8d8498cea8)

**Figure 1.** An example of document tagging on a popular user generated content website. The tags inform potential readers what kind of content will be in the text without spoiling the story.


A third approach to align language models with human expectations in a more transparent and end-user controllable manner is called Conditional Pretraining. In this method, a large number of pretraining examples are tagged with labels that describe the content using human-understandable classifiers. Content tagging is used in nearly all human generated online information-sharing environments as a way to organize content, and help users find information most relevant to their interests. This labeling can be performed in a mostly unsupervised fashion, utilizing encoder-only or encoder-decoder natural language understanding (NLU) machine learning models.

There are many widely used tags online that help categorize and filter content based on user preferences. "Suitable for work" (SFW) and "not suitable for work" (NSFW) tags are commonly found on sites like Reddit, Imgur, and various online forums. Additionally, book and movie reviews often utilize the "Spoilers" tag to indicate if the review contains information that may negatively impact the enjoyment of the content. User-generated story sites, such as Archive of Our Own (AO3) and FanFiction.net, employ diverse tags to provide clear indications of the content readers can expect within the stories (Figure 1). Furthermore, labels like G, PG, PG-13, and R, have been utilized for decades to inform users about television and movie content.

By leveraging conditional pretraining, language models could be better adapted to users' interests and preferences, resulting in a more aligned and enjoyable experience.


## **Converting Existing Pretraining Data into Conditional Pretraining Data**

The prevailing method for training LLMs involves collecting vast quantities of text from the internet and feeding this minimally processed text into the LLM. The pretraining objective is to predict the subsequent word given all prior words in the training example. Often, the text is divided in a manner that allows documents to be fragmented at any point, such as in the middle of a paragraph. These fragments are then randomly incorporated into larger batches of training examples, typically ranging from 2 to 4 million examples per training step. Although this approach has proven effective, it may not be the most optimal way to train these models.

![cond_pretrain_im2](https://github.com/LAION-AI/laion.ai/assets/22318853/4e3adab4-b20c-4c91-9b2f-e2140a8902b0)

**Figure 2.** Comparison of existing LLM training strategies and the conditional pretraining approach. Theoretically every example used to train the model could be tagged.

In contrast, conditional pretraining aims to prepend each training example with a set of descriptive tags and a brief synopsis that accurately represents the text in the training example (Figure 2). These tags and synopses can be efficiently generated using fine tuned NLU models such as BERT or T5. Although there is considerable computational cost associated with processing all the training examples, once the conditional pretraining examples are generated, they become reusable and easily understandable by humans. This approach enhances the training process, resulting in more accurate and user-friendly language models.


## **Transparency and Accountability**

Another significant advantage of conditional pretraining is the transparency of the tags used on documents, which can be easily understood by auditors or end users of the models. At present, the instructions and reward models employed in most LLMs are proprietary and not available for public review. This lack of transparency makes it challenging to comprehend how and why models respond to culturally or politically sensitive topics. Even when there are disagreements among people about how these models should be aligned and what values they should uphold, it is difficult to engage in meaningful discussions or debates on these sensitive topics as long as the values of the organizations developing the LLMs remain concealed or obscured by carefully crafted press releases and position papers.


## **How to Prepare a Conditional Pretraining Dataset**

![cond_pretrain_im4a](https://github.com/LAION-AI/laion.ai/assets/22318853/741f09aa-37b8-4aa3-a2f2-365c57299137)

We have developed a fine tuned LoRA model based on the open source FLAN-UL2 that takes as input about 2000 words of text and outputs the conditional pretraining labels for the document. An example output from this conditional tagging model for a recent news article about LAION in [Forbes](https://www.forbes.com/sites/hessiejones/2023/04/19/amid-growing-call-to-pause-ai-research-laion-petitions-governments-to-keep-agi-research-open-active-and-responsible/) is below. To generate these document tags only text from the body of the article was used.

## **Example Outputs from a New Conditional Pretrained Model**

Below you can find a toy example of how to control the behavior of the conditional language model. In this example, the conditional labels are used to create a very unhelpful chatbot or one that is helpful. These outputs are from the base conditional pretrained model, without any explicit instruction tuning or examples of chatbots in the training data.

**<center>Adorable baby chatbot</center>**
![image](https://github.com/LAION-AI/laion.ai/assets/22318853/85aca1d8-2243-467b-a5b1-d2abc7ffad09)

**<center>Unhelpful chatbot</center>**

![cond_pretrain_im3a](https://github.com/LAION-AI/laion.ai/assets/22318853/5b0a226a-04e0-49c3-9018-c4bb678e052c)


**<center>Helpful chatbot</center>**
![cond_pretrain_im3b](https://github.com/LAION-AI/laion.ai/assets/22318853/4e3878ea-3faa-4349-9b74-8d09d960516e)


## **How to Use The Models and Contribute to This Project**

The initial code and models are available on Github and Huggingface. Conditional pretrained models can be used exactly the same way as any other large language model, just remember to prepend your conditionals to the start of your input and spend some time experimenting with what tags suit your use case. 

We are in the process of converting very large pretraining datasets from the internet to conditional pretraining datasets and if you are someone that gets excited about building large datasets we would welcome your help on this effort. On the more experimental side of things, we are interested in developing reward models that efficiently calculate how well the outputs from conditional pretrained models conform with their conditionals. Please checkout the LAION discord or github if you are interested in contributing.


If you are interested, please check out the following links:
- [Demo-Colab-Notebook](https://colab.research.google.com/drive/1fbXOqeEkqygnWKSPKddQtaMiZEc0KYFY?usp=sharing) - Colab for playing with our models.
- [7B-redpajama-conditional-alpha](https://huggingface.co/Rallio67/7B-redpajama-conditional-alpha) - Redpajama base 7B model finetuned on ~2 million 2048 context conditional pretraining examples.
- [3B-redpajama-conditional-alpha](https://huggingface.co/Rallio67/3B-redpajama-conditional-alpha) - Redpajama base 3B model finetuned on ~2 million 2048 context conditional pretraining examples.
- [neox-20b-conditional-alpha](https://huggingface.co/Rallio67/neox-20b-conditional-alpha) - gpt-neox-20B base model finetuned on ~600 thousand 2048 context conditional pretraining examples.
- [flan-ul2-20b-condlabeler-alpha](https://huggingface.co/Rallio67/condlabeler-alpha) - LoRA finetuned flan-ul2-20b model that you can use to create conditional labels for your own text. Please verify that the labels you are generating match your expectations with some texts you are already personally familiar with.
- [LAION GitHub Repository](https://github.com/LAION-AI/)
- 💬 [LAION Discord](https://discord.gg/HzJU2kuC)

## **Acknowledgements**
- [StabilityAI](https://stability.ai/) for pre-emptible compute resources.
- [EleutherAI](https://github.com/EleutherAI/gpt-neox) for opensource GPT-Neox.
- [huggingface](https://huggingface.co/) for open source model hosting and code base.
- [RedPajama-INCITE](https://www.together.xyz/blog/redpajama-models-v1) for training and releasing opensource base models.
- [google-research](https://github.com/google-research/t5x) for training and releasing opensource T5 models which we used to create conditional labels.

## **References**
Conditional pretraining is very straightforward conceptually and does not require any complex mathematical arguments for it's justification. If you want to read a recent academic text discussing the concept in more detail please check out the paper by Anthropic. Conditional Pretraining was also used by Google to create Palm 2.
- [Pretraining Language Models with Human Preferences](https://arxiv.org/abs/2302.08582) by Anthropic.
- [PALM-2 Technical Report](https://ai.google/static/documents/palm2techreport.pdf) by Google AI. Search for "control tokens" to find relevant information.


Conditional Pretraining of Large Language Models


**An Open Letter to the European Parliament: Protecting Open-Source AI for a Safe, Secure, and Sovereign Digital Future**

LAION, alongside prominent research institutions and developers, has penned an [open letter to the European Parliament](/documents/open-letter-to-eu-parliament.pdf) to express concerns about the draft AI Act's potential impact on open-source research and development (R&D) in artificial intelligence (AI). The letter highlights the importance of open-source R&D for ensuring the safety, security, and competitiveness of AI in Europe and warns against the consequences of stifling such innovation.

## The Importance of Open-Source AI

The letter outlines three main reasons why open-source AI is worth protecting:

1. **Safety through transparency:** Open-source AI promotes safety by enabling researchers and authorities to audit model performance, identify risks, and establish mitigations or countermeasures.
2. **Competition:** Open-source AI allows small to medium enterprises to build on existing models and drive productivity, rather than relying on a few large firms for essential technology.
3. **Security:** Public and private organizations can adapt open-source models for specialized applications without sharing sensitive data with proprietary firms.

## Concerns with the Draft AI Act

The draft AI Act may introduce new requirements for foundation models, which could negatively impact open-source R&D in AI. The letter argues that "one size fits all" rules will stifle open-source R&D and could:

- Entrench proprietary gatekeepers, often large firms, to the detriment of open-source researchers and developers
- Limit academic freedom and prevent the European research community from studying models of public significance
- Reduce competition between model providers and drive investment in AI overseas

## Recommendations for the European Parliament

The open letter makes three key recommendations:

1. **Ensure open-source R&D can comply with the AI Act:** The Act should promote open-source R&D and recognize the distinctions between closed-source AI models offered as a service and AI models released as open-source code. Where appropriate, the Act should exempt open-source models from regulations intended for closed-source models.
2. **Impose requirements proportional to risk:** The Act should impose rules for foundation models that are proportional to their actual risk. A "one size fits all" framework could make it impossible to field low-risk and open-source models in Europe.
3. **Establish public research facilities for compute resources:** The EU should establish large-scale supercomputing facilities for AI research, enabling the European research community to study open-source foundation models under controlled conditions with public oversight.

## The Future of AI in Europe

The letter concludes with a call to action for the European Parliament to consider the points raised and foster a legislative environment that supports open-source R&D. This approach will promote safety through transparency, drive innovation and competition, and accelerate the development of a sovereign AI capability in Europe.

With numerous esteemed supporters, including the European Laboratory for Learning and Intelligent Systems (ELLIS), the Pan-European AI Network of Excellence, and the German AI Association (KI-Bundesverband), the letter serves as a powerful reminder of the importance of protecting open-source AI for the future of Europe.

## Supporters


- European Laboratory for Learning and Intelligent Systems (ELLIS) - Pan-European AI Network of Excellence
- German AI Association (KI-Bundesverband) - With more than 400 companies, the largest AI network in Germany
- **Prof. Jürgen Schmidhuber**: Scientific Director of the Swiss AI Lab IDSIA (USI & SUPSI), Co-Founder & Chief Scientist of NNAISENSE, Inventor of LSTM Networks
- **Prof. Sepp Hochreiter**: JKU Linz, Inventor of LSTM Networks
- **Prof. Bernhard Schölkopf**: Director, Max Planck Institute for Intelligent Systems and ELLIS Institute, Tübingen, Germany
- **Prof. Serge Belongie**: University of Copenhagen; Director, Pioneer Centre for AI
- **Prof. Andreas Geiger**: University of Tübingen and Tübingen AI Center
- **Prof. Irina Rish**: Full Professor at Université de Montréal, Canada Excellence Research Chair (CERC) in Autonomous AI and Canada CIFAR AI Chair, core member of Mila - Quebec AI Institute.
- **Prof. Antonio Krüger**: CEO of the German Research Center for AI (DFKI) and Professor at the Saarland University
- **Prof. Kristian Kersting**: Full Professor at Technical University of Darmstadt and Co-Director, Hessian Center for AI (hessian.AI)
- **Jörg Bienert**: CEO of German AI Association, CPO of Alexander Thamm GmbH
- **Patrick Schramowski**: Researcher at German Center for Artificial Intelligence (DFKI) and Hessian Center for AI (hessian.AI)
- **Dr. Jenia Jitsev**: Lab Leader at Juelich Supercomputing Center, Research Center Juelich, Helmholtz Association, ELLIS member
- **Dr. Sampo Pyysalo**: Research Fellow at the University of Turku, Finland
- **Robin Rombach**: Co-Developer of Stable Diffusion, PhD Candidate at LMU Munich
- **Prof. Michael Granitzer**: Chair of Data Science University of Passau, Germany and Coordinator of OpenWebSearch.eu
- **Prof. Dr. Jens Meiler**: Leipzig University, ScaDS.AI Center for Scalable Data Analytics and Artificial Intelligence
- **Prof. Dr. Martin Potthast**: Leipzig University, ScaDS.AI Center for Scalable Data Analytics and Artificial Intelligence, and OpenWebSearch.EU
- **Prof. Dr. Holger Hoos**: Alexander von Humboldt Professor in AI at RWTH Aachen University (Germany) and Professor of Machine Learning at Universiteit Leiden (Netherlands)
- **Prof. Dr. Henning Wachsmuth**: Chair of Natural Language Processing at the Institute of Artificial Intelligence, Leibniz University Hannover
- **Prof. Dr. Wil van der Aalst**: Alexander von Humboldt Professor in Process and Data Science at RWTH Aachen University and Chief Scientist at Celonis
- **Prof. Dr. Bastian Leibe**: Chair of Computer Vision at RWTH Aachen University (Germany)
- **Prof. Dr. Martin Grohe**: Chair for Logic and the Theory of Discrete Systems, RWTH University
- **Prof. Ludwig Schmidt**: Paul G. Allen School of Computer Science & Engineering, University of Washington
- **Dr Morten Irgens**: Vice Rector, Kristiania, Co-founder and board member of CLAIRE (the Confederation of Laboratories of AI Research in Europe), Adra (the AI, Data and Robotics Association) and NORA (the Norwegian AI Research Consortium)
- **Prof. Dr. Hector Geffner**: Alexander von Humboldt Professor in AI at RWTH Aachen University (Germany), and Wallenberg Guest Professor in AI at Linköping University, Sweden
- **Prof. Dr. Hilde Kuehne**: Goethe University Frankfurt (Germany), MIT-IBM Watson AI Lab (USA)
- **Prof. Gerhard Lakemeyer, Ph.D.**: Head of the Knowledge-based Systems Group and Chair of the Computer Science Department, RWTH Aachen University, Germany
- **Sebastian Nagel**: Crawl Engineer, Common Crawl, Konstanz, Germany

A Call to Protect Open-Source AI in Europe


We present the development and assessment of a binary classifier designed to distinguish between authentic images and images generated 
using Stable Diffusion (SD) v1.4. We will discuss the dataset employed, describe the model architecture, outline the training process, 
and present the results obtained. Furthermore, we will explore potential future work aimed at enhancing the classifier's performance. 
The source code, training parameters, and model weights are [available in this repository](https://huggingface.co/realfakerepo/realfake).

### Dataset

The training dataset was assembled in two steps. First, four image datasets were merged:

1. [`imagenet-1k`](https://huggingface.co/datasets/imagenet-1k): A widely used subset of ImageNet spanning 1,000 object classes.
2. [`laion2B-en-aesthetic`](https://huggingface.co/datasets/laion/laion2B-en-aesthetic) (parts 400 to 699): A subset of images from the LAION-5B dataset, estimated to be [aesthetic](https://github.com/LAION-AI/laion-datasets/blob/main/laion-aesthetic.md) by a model trained on top of CLIP embeddings.
3. [`imagenet-1k-SD-1.4`](https://huggingface.co/datasets/ChristophSchuhmann/Imagenet-1k-SD-1.4): A newly-created dataset that serves as a "twin" to the "real" `imagenet-1k`, containing the same 1,000 classes but generated using Stable Diffusion v1.4 with a variety of prompts per class.
4. [`DiffusionDB 2M`](https://huggingface.co/datasets/poloclub/diffusiondb): The first large-scale text-to-image prompt dataset.

Second, two million images were sampled from the merged data, ensuring an equal distribution of real and SD-generated images. Around 10% of that data 
is put aside as a validation subset to track the prediction quality during the training process. The following table shows the number of records 
assigned to each subset. This diverse and balanced dataset provided a solid foundation for training the model.

| Label \ Subset | Training | Validation |
|----------------|----------|------------|
|      fake      |  898785  |   101215   |
|      real      |  899986  |   100014   |

The specific list of samples used in training is stored in the [`metadata/prepared.2000k.jsonl`](https://huggingface.co/realfakerepo/realfake/tree/main/metadata) file available in the repository. Each record includes information about its subset and path to the sample stored on a local disk. 
This allows for flexible selection of images for training and validation. Additionally, the folder contains smaller prepared subsets used for debugging purposes. Note that for the `imagenet-1k` dataset, the training and validation subsets were prepared such that the classes of images do not overlap.

### Model Architecture and Training Process

We selected a straightforward model architecture utilizing a fine-tuned [ConvNext Large](https://pytorch.org/vision/main/models/generated/torchvision.models.convnext_large.html) model with approximately 200 million parameters. This choice was made to obtain quick results using 8x A100 GPUs on the Stability AI cluster.

The training process employed a One-Cycle learning rate scheduler, AdamW optimizer, and basic augmentations such as affine transformations, crops, and cutouts. The model was trained for five epochs starting from pre-trained weights (imagenet-1k) with all layers unfrozen from the beginning. Investigating more sophisticated training strategies is beyond the scope of this work but may be interesting for future research.

### Results

The trained classifier achieved close to 99% accuracy on the validation dataset described in the #Dataset section. Further testing of the model's generalization capability in distinguishing between real and SD-generated images was performed by creating _an additional, out-of-sample test set_. 
It comprised 2,500 images generated with SDv1.4 using a set of prompts proposed by LLM, with each prompt generating 100 different images. In addition,
the test set included 2,500 images from the `imagenet-1k` validation set. Therefore, none of the test set images is seen during the training process.

The following plots illustrate the model's confidence levels. Analyzing the results, several interesting conclusions can be drawn:
* Views of nature, construction works, and furniture often cause confusion.
* Real images with visual noise or uncommon objects are mistakenly classified as generated images.
* Images with visually distinguishable generative artifacts (incorrectly rendered humans, wheels, airplanes, unrealistic lines) are classified as fakes with high confidence.

![](/images/blog/realfake-classifier-real-least-confident.png)
![](/images/blog/realfake-classifier-real-most-confident.png)
![](/images/blog/realfake-classifier-fake-least-confident.png)
![](/images/blog/realfake-classifier-fake-most-confident.png)

As expected, cases with obvious generative model-produced artifacts are easily classified that . For instance, images with humans often include clear artifacts such as unnatural postures or impossible positions. Another interesting class of images pertains to natural landscapes. In some instances, they are easily recognized as fakes, while others confuse the model. This also holds true for construction works and some furniture images.

The inference notebook is available on [Google's Colab](https://colab.research.google.com/drive/1zZR55CpHdKaVQXhZ3yxvOu55jCDkADam).

### Limitations

It is important to note that the current model is still a work in progress. The classifier only saw images produced with Stable Diffusion V1.4, 
with all possible image artifacts that it produces. (See the example below.)

![](/images/blog/realfake-classifier-artifacts.png)

Therefore, it might be the case that the classifier pays attention to those SD-specific artifacts, and wouldn't perform that well on the output 
of other generative models.

Another possible limitation is low image resolution. The classifier resizes images to 256px per side, and further crops it to 224px. It might be difficult to effectively classify high-resolution examples.

Finally, the classifier's quality isn't compared against human's performance. As was mentioned before, some fakes have easily recognized artifacts, while others aren't distinguishable by the human eye because of low resolution. Building a testing dataset assets by humans should give a baseline to better estimate model's performance.

### Future Work

Building on this work, there are several avenues for further exploration:

1. Using various kinds of generative models for building a more challenging dataset to ensure that the classifier works well across 
various generative techniques.
1. Increasing input resolution to ensure that the model can capture fine details.
1. Creating a test set classified by volunteers to establish a quality baseline for better assessing model's performance.
1. Investigating whether the classifier can be used to guide SD models (akin to GANs) to steer them towards generating more realistic images. By providing feedback on the realism of generated images, the classifier might help improve the quality of synthesized images.

### Acknowledgements and Contributions

* Christoph Schuhmann conceived the initial idea of building a binary classifier to distinguish real vs. generated images, prepared the `imagenet-1k-SD` dataset, and guided the development process.
* [Stability AI](https://stability.ai/) provided us with compute resources to store the data and train the classifier.
* The [fast.ai](https://docs.fast.ai/) library was used for quick prototyping of the initial model.
* Scalable training was done via [PyTorch-Lightning](https://lightning.ai/docs/pytorch/stable/).
* Numerous other open-source tools, models, and datasets made this work possible.


Training a Binary Classifier to Distinguish Images Generated with Stable Diffusion (v1.4) from Real Ones

## Introduction

With the rapid explosion of large language models and utilization of their encompassing applications, most notably [ChatGPT](https://openai.com/blog/chatgpt), there is a clear promise of more capable and useful AI models/systems. Often, such models are compared to us as humans using the Turing test or their performance on tasks relative to humans. As of recent, these models have even achieved incredible success on tests designed for humans such as the LSAT. However, the limited means by which one can interact with such systems  elucidates a variety of opportunities for exploration and possibly discovery. We ask whether modalities can be mixed and learnt alongside one another, and whether that environment of learning offers new avenues for understanding.

With this in mind, we are excited to introduce a relatively new project at [LAION](https://laion.ai/) called General-GPT.


## Goals

In an effort to keep this concise, we enumerate our goals as follows:

1. Explore the ability to directly intertwine any modality into large language models (LLMs), such that expression of ideas and responses can be more natural and informative.
2. Allow longer contexts by inputting embedded sequences rather than operating directly on the sequences themselves. Though we may lose fine-grained details of the original sequences, it may prove useful for higher-level tasks.
3. Provide open-source tools, methods, and models that we hope extend our bigger picture goal of "democratizing AI."


## Experiments

### Text-Image Expression
Currently, our efforts have been primarily centered around experimenting with whether or not we can format our first goal into a trainable and functioning model. In order to do so, we first simplified the problem in a three ways. First, we choose to focus on tackling only the text-image domain rather than the full gamut that we hope to include. Secondly, we format the problem as a straightforward mapping from $x \rightarrow y$ or $y \rightarrow x$. Where $x$ represents an image embedding and $y$ represents the accompanying text. Finally, we tune on just the [MS-COCO](https://cocodataset.org/#home) [1] 2017 training set of 591753 image-caption pairs.

To construct $x$ we utilize [CLIP](https://openai.com/research/clip) [2], specifically CLIP *ViT-L/14*, to encode the images. On the other hand, we utilize [GPT-2](https://huggingface.co/gpt2) [3] as our LLM that receives mixed inputs and grounds for multimodal understanding or expression. The choice of these two models as baselines comes from their relatively reasonable scale, existing work and research, and the common dimensionality of their encodings. 

#### Image Captioning: $x \rightarrow y$ 
For this task, we introduce two specific tokens into the vocab so that the model may recognize when an embedding is being input and what that embedding is. Intuitively, the first token ("[CLIP IN]") should signal that there is an image embedding before the second token ("[\CLIP IN]"). Therefore, the training data for this task is structured as follows:

*<center>[CLIP IN] **embedding** [\CLIP IN] Caption: [MS-COCO caption ...].</center>*

In regards to training itself, we follow [CLIP prefix captioning](https://github.com/rmokady/CLIP_prefix_caption) [4] and simply insert the image embedding as a new token in between our two new tokens. Then, we introduce a dummy token as our target token at the same inserted position. Lastly, the loss for this task is just cross-entropy between shifted-by-1 logits and the original target indices with the dummy token being ignored.


| Encoded Image | Generated Caption | Original Caption|
|  :----: | :----: | :----: |
| ![Catch Example](/images/blog/general-gpt_captioning_example-1.png) | A man and a child playing baseball. | A man and a boy are playing catch in a yard. |
| ![Sleeping Dog](/images/blog/general-gpt_captioning_example-2.png) | A dog laying on a sidewalk next to a bike. | a white dog is sleeping on a street and a bicycle |

Table 1: Results of image captioning with CLIP embeddings as input into GPT-2.


#### Image Retrieval: $y \rightarrow x$
Similar to the first task, we also introduce two additional tokens: "[CLIP OUT]" and "[\CLIP OUT]." As there text suggests, they represent the position and container for the CLIP image embedding. The training data for task is formatted as such:

*<center>Caption: [MS-COCO caption ...]. [CLIP OUT][\CLIP OUT] </center>*

An interesting difference between the two task arises in the training procedure. Here, we must enforce GPT-2 to learn image representations that are as close to the original CLIP image embeddings as possible. In order to do this, we compute the mean squared error between the last hidden state at the position of the "[\CLIP OUT]" token and the original CLIP embedding. Finally, we perform the same cross-entropy loss for language modeling.

| Caption      | MS-COCO | LAION-5B
| :---: | :---: | :---: |
| Birds flying over the beach. | ![Beach Birds](/images/blog/general-gpt_coco-retrieval_example-1.png)| <img src="/images/blog/general-gpt_laion-retrieval_example-1.jpg" width=600></src> |
| A nightstand with a collection of books. |  ![Room with Books](/images/blog/general-gpt_coco-retrieval_example-2.png) | <img src="/images/blog/general-gpt_laion-retrieval_example-2.jpg" width=300></src> |

Table 2: Nearest neighbors of GPT-2 image embedding prediction within MS-COCO and LAION-5B [5].


### Sentence Reconstruction
One significant limitation of current open-source LLMs is the constraint on context length. This constraint prevents models from effectively comprehending and reasoning over extensive background knowledge spanning thousands of sentences. To address this challenge, we propose an innovative approach that enables GPT models with a context length of 2048 or 4096, for example, to process and understand vast amounts of background information more efficiently.

As a preliminary experiment we evaluated how reasonable our second goal was by reconstructing the original text with GPT-2 from an input of its embedded representation. In other words, we hoped to see whether we could embed sentences into some shared dimensional space and then generate the same tokens from those sentences? If so, we may be able to shrink longer contexts into a series of sequence embeddings which would be useful across diverse sets of inputs.

To model this behavior, we followed a method similar to how we performed the aforementioned image captioning. However, we avoid adding any new tokens or structuring our training data. Instead, a simple encoding of each sentence using the sentence transformer [*all-mpnet-base-v2*](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) [6] is followed by the sentence itself. Then, we compute the cross-entropy loss as previously described with the output logits and target token indices.

| Original Caption | Reconstructed Caption |
| :---: | :---: |
| A man riding a motorcycle down the street. | A man riding a motorcycle down the street. |
| Two animals chasing each other in a barn. | Two animals chasing each other in a barn. |
| Two animals chasing each other in a farmhouse. | Two animals chase after a flock of farm animals in a barn. |

Table 3: Results of sentence reconstruction with *all-mpnet-base-v2* and GPT-2.


## Next Steps

Ultimately, our aim is to train GPT models to handle texts and sequences of other modalities entirely in semantic embeddings, such as sequences of CLIP embeddings for videos, where each CLIP embedding represents the image embedding of one image frame, or where one embedding could be the audio clip (CLAP) [7] embedding of 5 or 10 seconds of audio. By predicting sequences in these semantic spaces or streams of ideas, truly multimodal sequence learning could be realized, capable of learning robust and sophisticated world models by pretraining on data from various modalities.

Additionally, embeddings could be decoded by specialized decoders into different outputs, such as text, images, audio, and video, similar to what DALL-E (Ramesh et al., 2021) does with CLIP embeddings that get decoded into images. Coalescing modalities could open the door to more 

### Scale
In terms of scale, there are a few dimensions of the experimental setup that we will modify. Three such dimensions include larger models, larger datasets, and more complex data, which we expect will improve the generalization across inputs. In order to tune these larger models on richer data we also need to expand our computational resources, possibly in a distributed setting. 

We plan on introducing greater complexity to the current data by utilizing truly interleaved datasets and large context inputs. For the latter, we convert the background text into a series of sentence embeddings using a pre-trained sentence embedding model, CLIP, or the recently proposed SGPT [8]. Then, create a sequence of these sentence embeddings, effectively compressing the original lengthy text into a condensed representation that captures high-level semantic information. Next, the sequence of embeddings is provided to the GPT model with the more recent context in the form of text tokens. This additional input serves to inform the model about the specific grammar, syntax, and style of the text. The model is then tasked with generating a continuation of the text based on the thousands of sentence embeddings and the few hundred words of the most recent context.

By representing longer contexts as a series of sequence embeddings, we enable the GPT model to reason over the entire text at once, leading to more coherent and contextually informed outputs. This method could be especially useful for tasks requiring a deep understanding of vast amounts of background information, such as generating summaries of novels, long articles, or comprehensive research papers.

Current trends suggest that these modifications will improve our results, but greater complexity may lead to instability. If that is the case, additional modifications or redesigns will be necessary; all of which will be shared as they arise.

### New Tasks
Some obvious directions we plan to investigate include the extrapolation of the current design into other modalities such as audio and video. Additionally, we wish to understand whether a LLM can generate both text and images that play off one another. In such a case, the LLM wouldn't necessarily generate the images directly, but rather condition an image generation model. If we are able to show that image generation can be guided in an interleaved manner, then other modalities will again be an extension. 

Although our research in this direction is still preliminary and incomplete, it is highly promising, and we encourage everyone interested in this topic to join our server and contribute to our research. Part of what makes us excited for this project is all the ideas that the open-source community may come up with and even implement. For that reason, we would love any suggestions, feedback, and help!

## Notes

It is quite clear from the results that inputs that are out-of-distribution in both experiments leads to poor results. Though this isn't unexpected for the scale and goals of our experiments, it does hint at poor generalization in such a configuration. Further experiments will be essential in diagnosing the impacts of richer data and scale.

If you wish to contribute, stay updated, or learn a bit more about the current work, please check out the following links:
- 🧑‍💻 [GitHub Repository](https://github.com/LAION-AI/General-GPT)
- 💬 [LAION Discord](https://discord.gg/HzJU2kuC)
- 🎥 [Introduction Video](https://www.youtube.com/watch?v=LA3AC8gM6hw)


## Acknowledgements
We further thank the authors and contributors of the following works/repositories:
- [HuggingFace](https://github.com/huggingface/transformers)
- [CLIP Retrieval](https://github.com/rom1504/clip-retrieval)

Logo generated with [Craiyon](https://www.craiyon.com/)


## References

[1] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.

[2] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[3] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

[4] Mokady, R., Hertz, A., & Bermano, A. H. (2021). Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.

[5] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. _ArXiv, abs/2210.08402_.

[6] Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

[7] Elizalde, B., Deshmukh, S., Ismail, M. A., & Wang, H. (2022). Clap: Learning audio concepts from natural language supervision. arXiv preprint arXiv:2206.04769.

[8] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.


NOTES