‘Tokenisation’ is a term bandied about by my AI literate colleagues, commonly at least 40 years my junior. Usually, the term is used in the context of ‘it takes X number of tokens to complete this job’ or similar.
In AI, tokenization is like breaking a sentence or text into smaller pieces (tokens) so that a computer can understand and based on statistics, predict what the next word (token) will be. The process takes a paragraph and chops it into words, parts of words, or even individual letters now called tokens. This then enables the system to find patterns and make predictions of the next part of the text using probabilities derived from the context represented by the previous tokens.
All AI platforms like ChatGPT, Perplexity, Claude, and myriads of tools that have emerged from the woodwork over the past 2 years, are trained to understand and generate tokens in this manner. They ‘learn’ by examining tons of text and figuring out how words and sentences relate to each other statistically. Tokenisation is the first step in this process.
When you give text to an AI tool, it:
- Breaks the text into ‘tokens’.
- Assigns a number to each token (a kind of code).
- Processes these numbers to understand statistically the patterns and relationships.
- Uses this understanding to answer questions, summarize, or generate text or in some cases, a graphic representation of the tokens.
There are different ways to break down text, depending on the instructions given to the model. It is also why the response to instructions can sometimes go crazy, as the machine does not always ‘hear’ what you thought you said. The placement of something as simple as a comma in the instructions can, and often will, alter the output.
This breaking down of text into ‘tokens’ is an essential step in the AI process. It is all about statistics and patterns, without any meaning as we understand it being given to the words themselves.
AI is a predictive machine that gives you the next most likely outcome given the instructions you have given the model. The way those instructions have been interpreted based on the way the tokens drive the patterns and relationships, delivers the outcome.
It is also how AI can work across languages, and why it consumes huge amounts of energy to run the billions of statistical calculations underlying the response it gives.
The above is a vast simplification of the process, but it ‘sort of’ satisfies an old marketer like me, trying to understand this new world that has suddenly arrived on my doorstep. It also explains the limitations of the models, as the ‘training’ is done on existing data. The system has no capability to leverage ‘knowledge,’ that human capability that enables completely disconnected facts and ideas to be put together in entirely new ways. I can only assume that this is where the current research is directed, building the ‘neural networks’ artificially, that we have as an outcome of millions of years of evolution.