26 July - AI with Whiskey

Shobhit Varshney
July 26, 2024

#GenAI world on 🔥🔥again with arguably the craziest week till date. Apologies for skipping the previous edition. When on vacation with kids, Waterparks >> Linkedin

but first.. 🥃Whiskey pairing suggestion.. Johnie Walker King George V. Recently, I had the pleasure of enjoying this luxurious whiskey with my dad to celebrate his 70th. Hints of roasted nuts, dark chocolate and an imposing sweet smokiness. Exceptional and mature just like my dad!

OpenAI

Released GPT-4o mini, a smaller, cheaper sibling of GPT-4o that does text & image (multimodal coming), good guardrails, and outcompetes other LLMs in it’s weight class eg. 82% on MMLU. Allows fine-tuning now.
GPT-4o-mini is very economical and will start a aggresive pricing war
Lets pick a real life usecase to show the economics. Say we are summarizing a 30 min call transcript into a page summary, and do it 1,000 times. Making realistic assumptions on token usage, it would cost:
- $40-$80 with Llama 3.1 405B on Azure/IBM/Databricks etc.
- $30-$40 with best in class GPT-4o or Claude 3.5 Sonnet
- $12-$30 with Llama 3.1 70B on Azure/AWS/IBM etc.
- $3-$4 with Llama 3.1 8B on Azure/AWS/IBM etc.
- only $1 on GPT-4o-mini on Azure
Performance is very impressive; tied at #1 in human preference Lmsys
The Uh Oh! Small LLMs are typically open-sourced, transparent with data/training and can be run locally on a mobile device. GPT4o-mini does none of these.
Released SearchGPT, a prototype for AI search with latest web content, proper citations, content parternship. Google lost $75B market cap yesterday, but has immediately released upgrades to their Gemini AI search experience in 40 languages across 230 countries.
Rumored to be building a model codenamed "Strawberry" to achieve human-like reasoning capabilities by planning ahead and autonomously navigating the web to do ‘deep research’ and execute complex tasks. It sounds like an evolution of Stanford’s STaR Self Taught Reasoner. SearchGPT could be a step towards something bigger.
Released CriticGPT, a GPT4 based LLM that’s trained as a critic, to augment humans in identifying LLM errors and providing feedback
OpenAI unveiled a five-tier system to gauge progress towards Artificial General Intelligence. 1- Chatbots, 2- Reasoners, 3- Agents, 4- Innovators, 5- Organizations. Seemed more marketing to me. It’s usually not a linear progression and it’s difficult to agree on an emperical evaluation.

Mistral

A day after Meta’s benchmark smashing release, Mistral released their largest model Large 2 - a 123B parameter model that is close enough and some times beating Llama3.1-405B, Claude 3.5 Sonnet and GPT4o which are >3x in size. At this stage we need tougher evals. Some great enhancements during training to reduce hallucination and reasoning.

This is the first time Mistral has opened up their largest model. You can get a commercial license thru them or partners like IBM watsonx. Love the price/ performance ratio of open models now.
Mistral also released other models in the past few weeks:
- NeMo - by far their best small open model. 12B paramters, very large 128k context length, strong performance in 9 languages, a more efficient tokenizer, in collaboration with NVIDIA
- Research purpose only, commercial use prohibited:
  - Codestral Mamba - a cutting-edge code generation model supporting 80+ programming languages. Uses Mamba2 architecture, offering fast code generation with context windows up to 256k tokens.
  - MathΣtral - a specific 7B model designed for math reasoning and scientific discovery

CrowdStrike + Windows outage

Last Friday the world turned blue with 8.5M windows screens across airports, hospitals, banks etc. showing the infamous Windows Blue Screen of Death, disrupting 5k+ flights and estimated $5B+ in losses
CRWD stock lost 25%+ ($20B+) in market cap in past few days
Root cause was a buggy update to the Falcon sensor software. Ideally an app issue would only crash the app. Typically only the operating system provider has access to the OS’s kernel and is protected from 3rd party apps. Unfortunately, in 2009, EU forced Microsoft to allow critical apps (eg. security) to have access to Windows protected kernel. The Crowdstrike update bug tried to access a restricted place in the Windows kernel causing the entire OS to crash. In case of Apple, 3rd parties dont have any access to MacOS kernel.
There are typically release protocols in place (multiple levels of checklists, system integration testing, stagged rollout etc.) to prevent these type of catastrophic updates. The testing tools didnt catch it. This was a systemic failure by Crowdstrike & Microsoft.
Facepalm moment - they are offering $10 Uber Eats as an apology.

Other GenAI models/news..

Apple released DCLM 7B - truly open source LLM, based on OpenELM, trained on 2.5T tokens with 63.72 MMLU (better than Mistral 7B). It’s one of the most open models in the industry sharing the training data, full rights, full transparency into training methods etc. Paper
Huggingface released SmolLM models (135M, 360M, & 1.7B) capable of running directly in the browser; they beat Qwen 1.5B, Phi 1.5B and more. Trained on just 650B tokens open dataset.
Salesforce released xLAM 1.35B & 7B Large Action Models along with 60K instruction fine-tuning dataset for research. The 7B model scores 88.24% on BFCL & 2B 78.94%
Google DeepMind AI wins silver medal at Intl. Math Olympiad.
Elon Musk announced x1 is building the world’s most powerful AI cluster, with 100,000 liquid-cooled H100s on a single RDMA fabric. For reference, Llama3.1-405B was trained on 16k H100s for 30M hours. Grok-2 expected to release in Aug, and Grok-3 by Dec.
Microsoft introduced SpreadsheetLLM that transforms excel files into more LLM friendly formats, and drastically improves accuracy.
NVIDIA released RankRAG, a framework to have a single LLM optimize both context ranking and answer generation, and improve RAG accuracy.
CharXiv to evaluate how well multimodal LLMs can understand and interpret charts. It has 2,323 diverse charts from scientific papers and tests LLMs on (1) descriptive analysis oquestions about examining basic chart elements, (2) reasoning questions that require synthesizing information across complex visual elements in the chart.

--------
🔔 If you like such content, I encourage you to connect on my LinkedIn ♻️ Recommend your friends to subscribe to this free ‘AI with Whiskey’ newsletter
--------

26 July - AI with Whiskey

Meta

OpenAI

Mistral

CrowdStrike + Windows outage

Other GenAI models/news..