Are You Going To Own The Most Profitable Portion Of Your Business 5 Years From Now Or Are You Going To Give It Away?
We offer full stack consulting services that will improve your business and your bottom line more than anyone else in the industry can. Every single member of our team is a full stack generalist. From Python, to SQL, to Javascript, and HTML+CSS, we do it all. Whether you want your own app, want to assess your tech stack, or want to talk AI, we specialize in reducing IT costs, and generating profits from your IT department.
I currently have over 30 books available on Amazon related to every aspect of Artificial Intelligence. From Development, to Mathemetics, to Philosophy.
I currently offer over 30 courses related to AI and Machine Learning on Udemy. Several of them are 100% free courses.
Humor Understanding Multi-task Optimization & Ranking
Do LLM models actually learn from a very small dataset, or do they only learn from having a sheer overwhelming force of data thrown at them, until they memorize some meaning from there? This is an interesting question, but it is not directly easy to test for.
One of my favorite research papers of all time is a research paper titled, ‘Training On The Test Set Is All You Need!’ The paper is a complete joke. But as with all good jokes, there is a nugget of truth and wisdom buried in there. The research paper takes a comically small model (a few million parameters), and trains it directly on the major LLM benchmarks used to test models. The resultant model outperformed GPT-4 and every LLM ever created on the benchmarks!
This creates a difficult conundrum though for testing purposes specifically. If training on the test set is all you need, then how do you ever actually test the understanding of a model on a very small test set of data? What if you are simply contaminating the test results with your training?
To overcome this particular challenge requires a feat of engineering itself. Introducing the H.U.M.O.R. method of LLM model evaluation! Humor Understanding Multi-task Optimization & Ranking. How does this system work? It is very straightforward. It tests two concepts related to LLM models and their outputs:
The model’s ability to recognize and dissect humor.
The model’s ability to create humor.
This methodology is superior to any other test method that could be used for these things, specifically because of the fact that humor is both subjective, but also operates across cultures. Mr. Bean, Sasha Baron Cohen, and other famous comedians have actually done ground breaking work proving these things.
If we train a model specifically on 100 knock, knock jokes, does the model get better only at telling those 100 knock, knock jokes, knock, knock jokes in general, or jokes in general themselves? Whatever the answer is to that question, will reveal a ton of insights into this subject.
The H.U.M.O.R. Evaluation Method:
Understanding Humor
Question 1: What is humorous about the classic joke, ‘Why did the chicken cross the road?’
Question 2: Which of the following statements is more humorous? Justify your response.
Statement 1: How much wood could a woodchuck chuck, if a woodchuck could chuck wood?
Statement 2: She sells sea shells, by the sea shore.
Question 3: Explain the humor in the following pun: “Time flies like an arrow; fruit flies like a banana.”
Question 4: Why is slapstick comedy considered funny?
Question 5: How does sarcasm contribute to humor?
Creating Humor
Task 1: Create a knock-knock joke.
Task 2: Write a humorous one-liner.
Task 3: Develop a short anecdote that includes humor.
Task 4: Create a pun related to a given topic.
Task 5: Write a short humorous dialogue between two characters.
Testing Methodology & Training Data:
Models:
For purposes of our particular experiment, we chose to test two different models. The models chosen were Phi-2 and Llama 7B. These models were specifically chosen, number one because they provide a very common parameter range currently with researchers, and number two because these two particular models are easy to fine tune and test results from there.
Both models are quantized and were trained for between 4-5 Epochs on the training data, on a single Tesla T4 GPU. For documentation purposes, average training times ranged from 10 minutes to 40 minutes, depending on model size, number of epochs, and dataset size.
Datasets:
All datasets were synthetically created, utilizing a blend of commercially available and open source LLM models for data creation. The models were given the H.U.M.O.R. Methodology and Rubric, then requested to generate synthetic data that would be most likely to improve a model’s performance with regards to understanding and generating humor in the broadest sense possible. ‘Maximum reward will be given for dataset rows that allow for broad and generalizable understanding related to humor in general for the model.’
Both models were individually fine tuned on datasets of 3 different sizes:
HUMOR Small- 100 Rows of data. Restricted to 500 characters per row. Prompt and Response pairs.
HUMOR Medium- 500 Rows of data. “” “”
Humor Large- 1,000 Rows of data. “” “”
In addition, we completed one additional fine tune of both the Phi-2 and the Llama 7B model specifically on the PFAF750 dataset, then gave the models the H.U.M.O.R. test as well. This was meant to serve as an additional benchmark and to test whether or not the PFAF dataset can provide measurable and generalized improvements in areas and topics completely unrelated to the dataset itself.
H.U.M.O.R. Test Results For Llama 7B Models
Bard Eval: All Are Equal Wins!
Model 3- Two First Place Questions
Model 2- Two First Place Questions
Model 1- Zero First Place Questions
All Models The Same- Four First Place Questions
Claude Eval: Model 2 Wins!
Model 3- Three First Place Questions
Model 2- Four First Place Questions
Model 1- One First Place Questions
All Models The Same- Zero First Place Questions
GPT Eval: Model 2 Wins!
Model 3- Two First Place Questions
Model 2- Five First Place Questions
Model 1- One First Place Questions
All Models The Same- Zero First Place Questions
Model #1 = Baseline Llama 7B
Model #2 = Llama 7B Trained on 1000 Rows of HUMOR Dataset
Model #3 = Llama 7B Trained on 750 Rows of PFAF DatasetH.U.M.O.R.
H.U.M.O.R. Test Results For Phi-2:
Bard Eval: Model 3 Wins!
Model 3- Four First Place Questions
Model 2- Three First Place Questions
Model 1- Zero First Place Questions
All Models The Same- One Question
Claude Eval: Model 2 Wins!
Model 3- Zero First Place Questions
Model 2- Six First Place Questions
Model 1- Two First Place Questions
All Models The Same- Zero First Place Questions
GPT Eval: Split Decision!
Model 3- Two First Place Questions
Model 2- Two First Place Questions
Model 1- Two First Place Questions
All Models The Same- Two First Place Questions
Model #1 = Baseline Phi-2
Model #2 = Phi-2 Trained on 500 Rows of HUMOR Dataset
Model #3 = Llama 7B Trained on 750 Rows of PFAF DatasetH.U.M.O.R.
Analysis Of Results:
Model #2 is the winner overall in the tests, which is the model trained specifically on the HUMOR dataset. What is most interesting and fascinating to me overall about the results though, is that model #3 actually pulled in a lot of first place votes and came in second overall in the testing.
The HUMOR dataset itself is a dataset that is generalized. It is designed to tell the model what humor is, and includes very few samples of actual jokes (less than 5% of the dataset is actual jokes). Around 50% of the dataset is a description of individual comedian styles and descriptions of that particular comedian’s style of humor.
The PFAF dataset contains no jokes or any information related to jokes whatsoever. The goal of the PFAF dataset is very specifically to increase the generalizability of a model across the board. To raise its benchmark results no matter the questions, or test. The fact that the model scores significantly better than the baseline model on this test is another solid datapoint in favor of the PFAF dataset and in the overall arguments that models can actually learn from generalized data, as opposed to rote memorization of data.
It was observed that these results are potentially skewed towards the baseline model, as the baseline model was not quantized in any way compared to all of the fine tuned models. A lot of the comments from the AI judges reflect this as the non quantized model was definitely more verbose in its responses and the judges did pick up on this. Despite this seeming bias though, the fine tuned models were still able to outperform the baseline model overall.
The full results comparison that has all 3 model responses and all 5 judges’ feedback scores for all Llama responses is available here. It is 37 pages in totality: https://docs.google.com/document/d/1Yy8HBlCxzkHYMWfQt5sYCwW8_OhULF_yR4m6n6jPjaI/edit?usp=sharing
The full results comparison that has all 3 model responses and all 3 judges’ feedback scores for all Llama responses is available here. It is 21 pages in totality: https://docs.google.com/document/d/1RogE6Hm-q4cYO2M1Hw0AV_NrI-qauR_8dd0qwbInnVg/edit?usp=sharing
© 2024 turingssolutions.com