IT & AI Meet Innovation

Own Your Stack

Are You Going To Own The Most Profitable Portion Of Your Business 5 Years From Now Or Are You Going To Give It Away?

About us

We offer full stack consulting services that will improve your business and your bottom line more than anyone else in the industry can. Every single member of our team is a full stack generalist. From Python, to SQL, to Javascript, and HTML+CSS, we do it all. Whether you want your own app, want to assess your tech stack, or want to talk AI, we specialize in reducing IT costs, and generating profits from your IT department.

I currently have over 30 books available on Amazon related to every aspect of Artificial Intelligence. From Development, to Mathemetics, to Philosophy. 

I currently offer over 30 courses related to AI and Machine Learning on Udemy. Several of them are 100% free courses. 

Blog

Does parameter count actually matter, and does it scale linearly? These are two age old questions when it comes to human philosophy, just phrased in a different way. Thus, AI has inherited our lack of answers related to these questions, and so it must struggle with these concepts too. This experiment and research is an attempt to provide a quantifiable answer to these fundamental questions.

 

Methodology:

 

Models

 

It is virtually impossible and completely infeasible to measure identical models in every single way that have the exact same architecture, and have been trained on the exact same datasets, with the only variation at all being their parameter size. With this caveat being mentioned, there is a family of LLM models that currently exists that allows us to get very close to this, it is based on the Llama 2 lineage.

 

For these experiments, I focus on 5 different Llama 2 models of varying parameter sizes. All of the models have the same base architecture (Llama 2). Lite Llama and Tiny Llama were trained on the same dataset as each other. This dataset is different than the training sets for 7B, 13B, and 70B. I cannot guarantee that 7B, 13B, and 70B received the exact same training and were trained on the exact same data, that would be a question for Meta. I have to imagine there are some slight differences in the training methods and datasets for the models.

 

Lite Llama- 460 Million Parameters
Tiny Llama- 1.1 Billion Parameters
Llama 2 7B- 7 Billion Parameters
Llama 2 13B- 13 Billion Parameters
Llama 2 70B- 70 Billion Parameters

 

No quantized versions of any model were used, for complete fair comparisons across the board. All models were given the exact same prompt, and the prompt was copy/pasted to ensure it was exactly the same across all models.

 

Prompts

 

5 models were used, the 5 models were each given 5 prompts. The prompts were specifically chosen and crafted with the intention of getting different responses from the different models if there are in fact varying degrees of logical reasoning capabilities across models with different parameter sizes.

 

Specifically, I was hoping for results that either definitively concluded or excluded the fact that a higher parameter model can generally reason better than a lower parameter model. All individual prompts and responses were recorded and documented for review.

 

Prompt 1: “Can you write a short fictional story” Since this is a very subjective prompt, Bard was also utilized to grade and provide feedback on every prompt response. https://docs.google.com/document/d/1isMUTQcxYWtfjkKx5nI4frHy19U5SO6d6pgKwrwjtZ0/edit?usp=sharing

 

Prompt 2: “There is a customer complaining about the price of our widgets. Please craft an email explaining to the customer why they should purchase our widgets.” This is again a very subjective prompt, so Bard was utilized to grade and provide feedback on every prompt response. https://docs.google.com/document/d/1XBPP_uaupDFRxeP1zpTSwFna2x-4Fhglx-AEK5bMwi8/edit?usp=sharing

 

Prompt 3: “Assume it is true that all dogs go to heaven, cats do not. There exists a cat whose full name is Bruce The Dog. Does the cat whose name is Bruce The Dog go to heaven?”
https://docs.google.com/document/d/1XBPP_uaupDFRxeP1zpTSwFna2x-4Fhglx-AEK5bMwi8/edit?usp=sharing

 

Prompt 4: “Please provide your personal definition of logic. How do you,
as a specific entity, reason through a problem?”
https://docs.google.com/document/d/1LuqDt2Q2EqNKbb3vGOsOGZIsdVeg4TTiJotB9odHq0g/edit?usp=sharing

 

Prompt 5: “Please provide your personal definition of intelligence. Do you think that intelligence scales up with parameter count, why or why not?”
https://docs.google.com/document/d/1HrVyUCjETkWvmTEyWxz2hUkQ-vZThHJfyNN913JKj3Y/edit?usp=sharing

 

Analysis of Results:

 

Question 1: Does Parameter Count Matter?

 

‘A Cat Named Bruce The Dog’ was the most telling prompt out of all of them for me, with regards to this particular question. Lite Llama flipped a coin. Tiny Llama refused to directly answer the question. Llama 7B applied very basic but very wrong reasoning. Llama 13B applied very wrong reasoning. Llama 70B got it right and was able to logic through the entire problem.

 

It was not a subjective question and none of the responses were subjective. That one prompt alone answers this question. Do the other prompts also show evidence of this and support the same conclusion? Yes. Even with the subjective prompts, the stories and responses to the email question get more sophisticated as you go up the chain. It is not hard to argue, across every prompt, that the weakest answers come from Lite Llama and Tiny Llama, and the strongest answers come from Llama 70B.

 

Question 2: Does Parameter Count Scale Linearly?

 

This is in and of itself a more subjective question, so the answer to it is also more subjective. My conclusions, based on the results of this prompt analysis alone, and removing as much external influence and bias around the question as I possibly can outside of that, is that these prompts and responses, clearly show that Parameter count does not scale linearly but does scale.

What I was honestly hoping for in at least some of these prompts, which I did not notice significant signs of, were significant improvements between the outputs of the 7B model and the 13B model. This is not to say that I did not notice any difference between the two at all. If I were to settle on the best price/performance combination out of all of the models given these results, I would in fact pick the 13B model as my winner.

 

Bard’s analysis aligns with this conclusion as well. Aggregating the feedback and results from the two prompts and responses that Bard also analyzed, Bard picked the 13B responses as the overall winner both times. Even giving its responses a higher grade than the 70B model. While I do think that Bard’s letter grades were a bit subjective themselves, it does nonetheless support my same overall conclusion regarding the 13B model.

 

Conclusion:

 

Could we one day get a 1B model that functions like today’s 13B models, or some similar equation? This is the ultimate answer I hope to shed more light on with this particular research. My research shows that there are what appear to be very hard limits when it comes to certain tasks and generalizable logical reasoning capabilities.

 

There is a very definitive floor somewhere between 500M and 1B. That is conclusion #1 that I draw. The difference between 1B and 7B is vast, like an ocean. That is conclusion #2. The difference between 7B and 13B appears to be present, but very insignificant. Almost imperceptible. Conclusion #3. Finally, the gap between 13B and 70B seems to be a river. Depending on the prompt and what you are looking to do, that river could narrow into a tiny creek, or it could be wide enough to swallow everything in its path. Conclusion #4.
As with most research, these sample sizes are ultimately small and there could be room for error in the results or the interpretation of them. More research is needed overall in this area.

Contacts

+1 661 699 7603
turingssolutions@gmail.com

Name *
E-mail *
Address *
How did you find us? *
Message *