LLM has a lot of parameters. But what is the parameter?

by
0 comments
LLM has a lot of parameters. But what is the parameter?

When a model is trained, each word in its vocabulary is given a numerical value that captures the meaning of that word in relation to all other words, based on how the word appears in countless examples in the model’s training data.

Each word is replaced by a type of code?

Yes. But there is something more to it. The numerical value – the embedding – that represents each word is actually a List Each number in the list of numbers represents a different aspect of meaning that the model has extracted from its training data. The length of this list of numbers is another thing the LLM designer can specify before the LLM is trained. A common size is 4,096.

Each word inside LLM is represented by a list of 4,096 numbers?

Yes, it is an embedding. And each of those numbers is changed during training. LLMs with 4,096 number long embeddings are considered to have 4,096 dimensions.

Why 4,096?

This may seem like a strange number. But LLMs (like anything that runs on a computer chip) work best with powers of two—2, 4, 8, 16, 32, 64, etc. LLM engineers have found that 4,096 is a power of two that hits a sweet spot between capacity and efficiency. Models with lower dimensions are less capable; Higher dimensional models are too expensive or slow to train and run.

Using larger numbers allows LLMs to get very good information about how a word is used in many different contexts, what subtle meanings it may have, how it relates to other words, and so on.

In February, OpenAI released GPT-4.5, the company’s largest LLM to date (some estimates put its parameter count at more than 10 trillion). Nick Ryder, an OpenAI research scientist who worked on the model, told me at the time that larger models could work with additional information, such as emotional cues, such as when a speaker’s words indicate hostility: “All these subtle patterns that come through in human conversations — those are the pieces that these larger and larger models will pick up on.”

The result is that all the words inside the LLM get encoded into a high-dimensional space. Imagine thousands of words floating in the air around you. Words that are close to each other have similar meanings. For example, “table” and “chair” would be closer to each other than “astronaut”, which is closer to “moon” and “musk”. From very far away you can see the “forecast”. It’s something like that, but the words inside the LLM are related to each other in 4,096 dimensions instead of being related to each other in three dimensions.

Oh.

This is a baffling thing. In effect, an LLM compresses the entire Internet into a single giant mathematical structure that encodes an unfathomable amount of interconnected information. This is why LLMs can do amazing things and why they are impossible to fully understand.

Related Articles

Leave a Comment