Do you know generative kung fu?

A primer on generative AI for everyone else

Mar 30, 2024

I’ve spent the past 3 weeks or so up to my ears in research on Generative AI and Large Language Models (LLMs) for a project. And, while this topic is not something that I thought I would be investing this much brain power into understanding, it’s helped me to contextualize all the hype around chatbots, AI-generated content creation, and the entire question of whether we should be using AI more in our day jobs. Maybe I’m just getting old and cranky.

Note: The definitions provided in this article are oversimplified at best, but they should give you the context you’ll need to hold a semi-intelligent conversation about the current state of generative AI.

The generative AI landscape is growing and evolving so quickly that even AI/ML engineers are having trouble keeping up with the latest and greatest tools. I’ve spent a healthy number of hours digging into everything from LinkedIn articles to Reddit forums to absorb as much knowledge as I can in a very short period of time - and chances are, it will all be outdated in a few short weeks. But, it’s also an exciting time for those who are energized by potential, unimagined changes that AI can bring to the world. New methodologies and formats and advances in technology and computing power are all emerging nearly every day. The current LLM market is like a massive hot dog-eating contest, but the contestants keep getting bigger and bigger instead of…well…exploding. Yet.

For those of us who do not spend every waking hour keeping up with the latest and greatest in the world of AI, here’s a very brief, salty primer on some of the terminology and concepts translated into human language that you’ll need to know in order to keep up with the kids these days.

Let’s start with Vocabulary

Large Language Model (LLM) - What you end up with when you cram a bunch of data aka hot dogs (words, images, videos, audio clips, programming code, etc.) into a very powerful computer, which is then “taught” to produce new data based on all the stuff you put in it. Think: auto-complete, but smarter. Ish.

Tokens - Remember when you had to write a 500 word essay on To Kill a Mockingbird and you were about 20 words short, so you tried to add spaces between sentences and expand contractions (don’t → do not), only to realize that your word processor is counting a standard length of characters instead of actual words with spaces between them? It’s like that.

Generative AI - Remember that part in The Matrix when Keanu Reeves learns Kung Fu by downloading the program into his brain? It’s a bit like that only if Neo was a generative AI, he could learn everything about martial arts anywhere and then create a new form of martial arts based on that data.

Imagine, if you were someone who has never seen or heard the Japanese language before. How difficult would it be for you to not only read a Japanese manuscript, but also to create an entirely new document (in Japanese, of course) that interprets the symbolism in that manuscript. Now, could you also explain when and where the manuscript was likely written, by whom, and all of the historical context during that period? Generative AI is aiming to be capable of exactly that: to interpret vast amounts of data, synthesize and analyze it, and produce an accurate and appropriate response that meets the user’s needs.

You wouldn’t believe how many programming and machine learning memes I found when searching for this image.

💡Did you know there are three different types of written Japanese characters (Hiragana, Katakana, and Kanji), which all have their own purposes and functions?

Inference - The process of figuring out the answer to a puzzle based on contextual clues and foundational knowledge without being explicitly given the answer. Like Sherlock Holmes. In an LLM, the computer statistically analyzes billions of tokens to “learn” how to predict the logical next word. Then, it “solves the puzzle” or produces an output that (hopefully) meets the parameters a user has given it. The computer itself doesn’t “understand” the language. It simply spits out the most likely next token.

Sounds like a lot of assumptions and probabilities. Can we trust the accuracy?

You clever minx, you. You’re right. There’s a lot of complex math at play here. These models take a ‘best guess’ at the right response based on probabilities and code voodoo, but they are also prone to making things up (like my 5 year old). That made up stuff (called hallucinations) is often close enough to reality that it can be hard to tell the difference.

Like a human, the computer learns over time with feedback. There are multiple steps of training that an LLM goes through to improve accuracy. Each step hones in on a smaller, more contextually-relevant dataset, which is great for uses in industry-specific tools, like chat bots to help lawyers or doctors or programmers to do their jobs faster and easier. This training process can also be used to teach the model the kinds of results that are not acceptable (like illegal content).

Okay, time for some more vocabulary.

Reinforcement Learning with Human Feedback (RLHF) - Like potty training your new puppy by praising them when they go outside and not on the living room rug. RLHF is the common method of training an LLM to align better with human preferences (i.e. make it sound less awkward and more human, or behave in a more expected way), in which a human provides feedback to the model on which of its responses are most preferable and natural.

Fine-tuning - Fancy word for extra training. Like continuing education classes.

What is on the horizon for this technology?

As you can imagine (or if you’ve ever had small children or pets), depending on humans to train a model that is referencing a dataset with billions of tokens is intense. It takes a lot of time and resources, prompting researchers in both academic and commercial settings to develop solutions for this problem. The most interesting alternative that I’ve read about recently is called Retrieval-Augmented Generation.

Retrieval-Augmented Generation (RAG) - An alternative to RLHF, used to increase the accuracy and reliability of LLMs by bringing in trusted external sources to supplement the data that the LLM is already trained on. Key benefits are:

the ability to adapt over time to new information. A typical LLM is only trained on a static set of data, which means that if you asked a chatbot who the current record holder for most albums simultaneously listed on the Billboard 200 is, you’d be disappointed in the results. Ever used ChatGPT? Go on and ask. I’ll wait. (Hint: It’s not Drake.)
more reliable, specific, and accurate responses are generated by putting greater weight on sources with authority and depth. The RAG-trained model is even enabled to provide sources for its responses so that users can validate them.
greater training efficiency by allowing developers to quickly re-purpose an already trained model for a different use case without having to re-train the model from scratch.
faster inference speed by giving the LLM additional and specific context to develop its response from. The LLM still uses the original model for the response generation task, but it’s able to limit its search for information to develop that response to a trusted source(s) rather than the full billion-token dataset.

If you’re interested in learning more about RAG, Databricks published an article on the subject that is quite thorough and accessible to most audiences.

Wait, you said BILLIONS of tokens? Where does all this data come from?

And THAT is the most prevalent question that LLM providers are tackling right now. Who owns the Intellectual Property (IP) that these models are trained on? Lawsuits have been filed by artists, authors, musicians, and even newspapers against generative AI companies. In 2024, we will likely see decisions in U.S. courtrooms on copyright suits against Microsoft/OpenAI and its primary competitor, Anthropic, NVidia, Bloomberg, and StabilityAI/Midjourney. The complaints range from violations of the Digital Millennium Copyright Act, or DMCA by removing metadata that would allow copyright holders to identify violations, to distributing works (for profit) without permission, attribution, or compensation to the creators. There is even a pending case against Microsoft’s GitHub for the company’s coding buddy, Copilot’s, use of open source code without attribution or other requirements of GitHub’s own open source license.

While I’m happy to pontificate on the ethical ramifications of generative AI, I’m frankly not qualified to prescribe or predict an outcome here. At this point, we’re all waiting to see what the courts decide. The implications of those decisions could dictate future development of the technology, as well as, corresponding legal policy. In the meantime, companies are anxious to start taking advantage of the promise behind this technology, so hopefully you’re feeling a bit more prepared to participate in those conversations going forward.

If you enjoyed this post, or at least learned something from it, I’d appreciate you sharing it by clicking the button below.

If you haven’t subscribed, click here to fix that: