a WhatsApp user and whose number appears on his company website.
That’s where the LLM found the number. It being a valid WhatsApp number is a coincidence.
I hate Facebook with a furious seething passion and have no vested interest in defending any of their shit, but this truly sounds like it made up a random number, with it being a total coincidence that the actual number belonged to another whatsapp user.
An LLM cannot generate random numbers. It has to pull from a list of numbers its model was built to include.
I mean the latter statement is not true at all. I’m not sure why you think this. A basic GPT model reads a sequence of tokens and predicts the next one. Any sequence of tokens is possible, and each digit 0-9 is likely its own token, as is the case in the GPT2 tokenizer.
An LLM can’t generate random numbers in the sense of a proper PRNG simulating draws from a uniform distribution, the output will probably have some kind of statistical bias. But it doesn’t have to produce sequences contained in the training data.
It could have the whole phone number as a string
A full phone number could be in the tokenizer vocabulary, but any given one probably isn’t in there
No way, tokens are almost always sub-word length.
Using larger tokens means that you need way more tokens to represent data and so encoders always learn to use short tokens unless they’re specifically forced not to.
Just to put it in perspective. Imagine that you were trying to come up with a system for writing down every phone number. The easiest system would be to have a vocabulary of 10 items (digits), with such a vocabulary you can write down all phone numbers. While storing entire phone numbers as a single ‘word’ would require a vocabulary of 10 billion items in order to write down all phone numbers.
That’s why encoders learn to use the smallest token sizes possible.
LLMs can’t generate random numbers, but the process of selecting the next token involves selecting a random (using a pseudorandom number generator) next token from the distribution of possible next tokens. The ‘Temperature’ setting alters how closely the random selection is to the distribution in the vector describing the next token.
An extreme example would be, on one end of the Temperature scale it always chooses the highest probability next token (essentially what the person you’re responding to is thinking happens) and on the other end of the scale it completely ignores the distribution and chooses a completely random token. The middle range is basically ‘how much do I want the distribution to affect my choice?’
In the end, the choice of the next token is really random. What’s happening is that the LLM is predicting the distribution among all possible tokens so that the sentence fits into its model of how language works.
I know, I’m just saying it’s not theoretically impossible to have a phone number as a token. It’s just probably not what happened here.
the choice of the next token is really random
It’s not random in the sense of a uniform distribution which is what is implied by “generate a random [phone] number”.
Unless the person is use math terms elsewhere, I always assume people mean ‘unexpected’ then they say random.
It’s not random in the sense of a uniform distribution which is what is implied by “generate a random [phone] number”.
Yeah, true.
There, I was speaking more to the top level comment’s statement that an LLM cannot generate random numbers. Random numbers are pretty core to how chatbots work… which is what I assumed they meant instead of the literal language model.
You could say that they’re technically correct in that the actual model only produces a deterministic output vector for any given input. Randomness is added in the implementation of the chatbot software through the design choice of having the software treat the language model’s softmax’d output as a distribution from which it randomly chooses the next token.
But, I’m assuming that the person isn’t actually making that kind of distinction because of the second sentence that they wrote.
a WhatsApp user and whose number appears on his company website.
Either the number was in the training set, or it came up on the Web search somehow.
Shhhhh if someone doesn’t know at thisbpoint, they can’t be told.