26 July - AI with Whiskey

#GenAI world on đŸ”„đŸ”„again with arguably the craziest week till date. Apologies for skipping the previous edition. When on vacation with kids, Waterparks >> Linkedin

but first.. đŸ„ƒWhiskey pairing suggestion.. Johnie Walker King George V. Recently, I had the pleasure of enjoying this luxurious whiskey with my dad to celebrate his 70th. Hints of roasted nuts, dark chocolate and an imposing sweet smokiness. Exceptional and mature just like my dad!

Meta

  • Meta released a massive open model Llama 3.1-405B that finally closes the gap with frontier models outcompeting GPT-4o and Claude 3.5 Sonnet in many benchmarks. Meta also upgraded their popular 70B and 8B models and released a full LLM stack and safety tools.

  • Huge inflection point establishing that “Future of AI is Open”. Very well aligned with IBM’s philosophy 

  • Includes tool use, multi-lingual agents, complex reasoning, long 128k context, coding assistants. Performing very well in our initial client tests.

  • Both Nvidia’s Nemotron-4 340B and Llama 3.1-405B allow users to generate synthetic data to train smaller models, allow distilling to reduce # of paramters and be a teacher model. GPT/Claude don’t allow this.

  • Closed source and Open-weight models trajectory have finally converged

  • Uh Oh! - Meta released these models under a custom license which puts constrains on usage, and does not provide any indemnification to protect enterprises. They did provide the weights and training methodology in a nicely detailed 92-page paper, but didn’t share the training data.

  • Meta will not release it’s models in EU for commercial use, due to the unpredictable nature of the European regulatory environment. Similar move to Apple delaying Apple Intelligence in EU. Fascinating thread by Giorgos on why Europe is not a tech leader.

OpenAI

  • Released GPT-4o mini, a smaller, cheaper sibling of GPT-4o that does text & image (multimodal coming), good guardrails, and outcompetes other LLMs in it’s weight class eg. 82% on MMLU. Allows fine-tuning now.

  • GPT-4o-mini is very economical and will start a aggresive pricing war

  • Lets pick a real life usecase to show the economics. Say we are summarizing a 30 min call transcript into a page summary, and do it 1,000 times. Making realistic assumptions on token usage, it would cost:

    • $40-$80 with Llama 3.1 405B on Azure/IBM/Databricks etc.

    • $30-$40 with best in class GPT-4o or Claude 3.5 Sonnet

    • $12-$30 with Llama 3.1 70B on Azure/AWS/IBM etc.

    • $3-$4 with Llama 3.1 8B on Azure/AWS/IBM etc.

    • only $1 on GPT-4o-mini on Azure

  • Performance is very impressive; tied at #1 in human preference Lmsys

  • The Uh Oh! Small LLMs are typically open-sourced, transparent with data/training and can be run locally on a mobile device. GPT4o-mini does none of these.

  • Released SearchGPT, a prototype for AI search with latest web content, proper citations, content parternship. Google lost $75B market cap yesterday, but has immediately released upgrades to their Gemini AI search experience in 40 languages across 230 countries.

  • Rumored to be building a model codenamed "Strawberry" to achieve human-like reasoning capabilities by planning ahead and autonomously navigating the web to do ‘deep research’ and execute complex tasks. It sounds like an evolution of Stanford’s STaR Self Taught Reasoner. SearchGPT could be a step towards something bigger.

  • Released CriticGPT, a GPT4 based LLM that’s trained as a critic, to augment humans in identifying LLM errors and providing feedback

  • OpenAI unveiled a five-tier system to gauge progress towards Artificial General Intelligence. 1- Chatbots, 2- Reasoners, 3- Agents, 4- Innovators, 5- Organizations. Seemed more marketing to me. It’s usually not a linear progression and it’s difficult to agree on an emperical evaluation.

Mistral

  • A day after Meta’s benchmark smashing release, Mistral released their largest model Large 2 - a 123B parameter model that is close enough and some times beating Llama3.1-405B, Claude 3.5 Sonnet and GPT4o which are >3x in size. At this stage we need tougher evals. Some great enhancements during training to reduce hallucination and reasoning.

  • This is the first time Mistral has opened up their largest model. You can get a commercial license thru them or partners like IBM watsonx. Love the price/ performance ratio of open models now.

  • Mistral also released other models in the past few weeks:

    • NeMo - by far their best small open model. 12B paramters, very large 128k context length, strong performance in 9 languages, a more efficient tokenizer, in collaboration with NVIDIA

    • Research purpose only, commercial use prohibited:

      • Codestral Mamba - a cutting-edge code generation model supporting 80+ programming languages. Uses Mamba2 architecture, offering fast code generation with context windows up to 256k tokens.

      • MathÎŁtral - a specific 7B model designed for math reasoning and scientific discovery

CrowdStrike + Windows outage

  • Last Friday the world turned blue with 8.5M windows screens across airports, hospitals, banks etc. showing the infamous Windows Blue Screen of Death, disrupting 5k+ flights and estimated $5B+ in losses

  • CRWD stock lost 25%+ ($20B+) in market cap in past few days

  • Root cause was a buggy update to the Falcon sensor software. Ideally an app issue would only crash the app. Typically only the operating system provider has access to the OS’s kernel and is protected from 3rd party apps. Unfortunately, in 2009, EU forced Microsoft to allow critical apps (eg. security) to have access to Windows protected kernel. The Crowdstrike update bug tried to access a restricted place in the Windows kernel causing the entire OS to crash. In case of Apple, 3rd parties dont have any access to MacOS kernel.

  • There are typically release protocols in place (multiple levels of checklists, system integration testing, stagged rollout etc.) to prevent these type of catastrophic updates. The testing tools didnt catch it. This was a systemic failure by Crowdstrike & Microsoft.

  • Facepalm moment - they are offering $10 Uber Eats as an apology.

Other GenAI models/news..

  • Apple released DCLM 7B - truly open source LLM, based on OpenELM, trained on 2.5T tokens with 63.72 MMLU (better than Mistral 7B). It’s one of the most open models in the industry sharing the training data, full rights, full transparency into training methods etc. Paper

  • Huggingface released SmolLM models (135M, 360M, & 1.7B) capable of running directly in the browser; they beat Qwen 1.5B, Phi 1.5B and more. Trained on just 650B tokens open dataset.

  • Salesforce released xLAM 1.35B & 7B Large Action Models along with 60K instruction fine-tuning dataset for research. The 7B model scores 88.24% on BFCL & 2B 78.94%

  • Google DeepMind AI wins silver medal at Intl. Math Olympiad.

  • Elon Musk announced x1 is building the world’s most powerful AI cluster, with 100,000 liquid-cooled H100s on a single RDMA fabric. For reference, Llama3.1-405B was trained on 16k H100s for 30M hours. Grok-2 expected to release in Aug, and Grok-3 by Dec.

  • Microsoft introduced SpreadsheetLLM that transforms excel files into more LLM friendly formats, and drastically improves accuracy.

  • NVIDIA released RankRAG, a framework to have a single LLM optimize both context ranking and answer generation, and improve RAG accuracy.

  • CharXiv to evaluate how well multimodal LLMs can understand and interpret charts. It has 2,323 diverse charts from scientific papers and tests LLMs on (1) descriptive analysis oquestions about examining basic chart elements, (2) reasoning questions that require synthesizing information across complex visual elements in the chart.

--------
🔔 If you like such content, I encourage you to connect on my LinkedIn ♻ Recommend your friends to subscribe to this free ‘AI with Whiskey’ newsletter
--------