AI with Whiskey - Aug 20

Shobhit Varshney
August 20, 2024

As usual, ton of action in the #GenAI world 🔥🔥🔥, with massive announcements from OpenAI, x1, Google, Microsoft etc. Lots to recap.

but first.. 🥃Whiskey pairing suggestion.. Glenfiddich 21 yr enjoyed with Sujay Saha. Super smooth and delicious traditional Speyside. Refined balance of rich toffee, fig, and banana notes, elevated by a unique rum cask finish that adds a touch of exotic spice. This along with Gordon & MacPhail Caol Ila 13 were my first two from the long list of recommendations by Sujay from his 6 week sabatical in Scotland.

OpenAI

Now allows GPT4o frontier model to be finetuned by enterprises. This allows adapting a benchmark leading model for specific domains. This allows firms like Cosine to rank #1 with 43.8% on Software Engineering benchmarks and Distyl to rank #1 on BIRD-SQL text-to-SQL. OpenAI is promising Data Privacy and Safety controls for enterprises.
Advanced Voice Mode starting to roll out very slowly to alpha testers. Lots of demos on YouTube. I got a chance to test it for a couple hours, and am truly impressed with the experience. It doesn’t add incremental intelligence, a bit glitchy at times as an alpha, but the interaction is super natural. Catching it’s breath, cracking jokes, voice modulation, accents, multi-lingual, very chatty. Going voice input to voice response in real-time without the long pipeline of speech-to-text is massive improvement. Huge potential, but long way before enterprises can leverage this in IVR within business constrains that need deterministic, auditable, consistent responses. OpenAI revealed that in a scary testing instance, GPT4o suddenly yelled ‘No!’ and started emulating the tester’s human voice.

Introduced Structured Output in their API responses. This seamingly small technical change has a huge impact on enterprise deployments. Let’s assume you call a GPT4o API to extract revenue from 10k doc. The format of response was somewhat inconsistent eg. “$4M” vs “4 million" or where the # shows up in the text. We can now specify a JSON schema and it will ensure the output is consistent. This drastically improves how GPT4o interoperates with rest of the application logic. OpenAI is able to constrain next token generation with context-free grammar.
Further slashed GPT4o price. New gpt-4o-2024-08-06, is 50% less on inputs ($2.50/1M tokens) and 33% less on outputs ($10.00/1M tokens). Looking at a blended 1M tokens (~750k words, 80% input 20% output) $36 for GPT4 in Mar'23, $14 for GPT4-turbo in Nov'23, $7 for GPT4o in May'24 and now $4 in Aug'24. That’s 89% lower price for exceptional frontier performance.
Released SWE-Bench-Verified, a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues sourced from GitHub. This benchmark presents agents with a code repository and an issue description, challenging them to create a patch that effectively resolves the problem outlined.

Google

Google had exciting AI updates during it’s Pixel event. 20 min recap here. Pixel 9 ships with on-device secure Gemini Nano. Gemini Live allows you to have a very natural real-time conversation with AI (similar to OpenAI advanced voice mode), but because its baked into Android, it can take actions on your behalf, interpret whats on your screen etc. There were great AI feature updates across Magic Editor, Google products etc. The Add Me feature allows the photographer to hand the camera to a friend and superimpost themselves to the photo. Super cool, but today’s generation doesn’t have the patience to do that vs. fun selfies.
Google updated it’s flagship Gemini 1.5 Pro (0801), which took the #1 spot in LMSys chatbot arena, a crowdsourced leaderboard where users vote on the best responses in head-to-head LLM comparison. With a 1297 ELO score, it dethroned OpenAI, the reigning champion for 2 yrs. But GPT 4o updated their model within a week to take back the crown.
Gemini 1.5 Pro (0801) also ranked #1 in vision, #1-3 in math performance, #1-2 in instruction following, #3-5 in coding across benchmarks.
Google’s AI image generation Imagen-3 was released to public. It’s doing an awesome job at photo realistic images, text etc. Try it .
US judge Amit Mehta issued a ruling that ‘Google is a monopolist, and it has acted as one to maintain its monopoly.’ recommending a breakup.

x1 Grok-2

Elon Musk’s x1 released it’s latest Grok-2 and Grok-2 mini LLMs with exceptional scores, even outcompeting some leading frontier models on some benchmarks. An early version, codenamed 'sus-column-r', was released on the popular LMSYS leaderboard and quickly climbed to the #3 position based on human preference.
Grok-2 integrates real-time data from the X (twitter) platform, which is both a pro and con. The quality and authenticity of posts of X is very questionable, and it takes a lot of create high quality training data.
Grok-2 leverages Forest Labs’ FLUX.1 for image generation while they work on their own version. Elon made a bold controversial decision to release a very photorealistic AI image generator with very few guardails, no watermarks etc. X has been flodded with inappropriate political, celebrity and brand images.
Grok2 is closed source with very few techincal details released. After releasing Grok 1.5, x1 had released weights of Grok1. Hoping they release Grok1.5 weights soon. Elon announced Grok3 is expected end of 2024.

Microsoft

Released updated Phi-3.5 models (Mini 3.8B, 4.2B vision model, 42B Mixture of Experts). Rivals GPT4o-mini, Meta3.1-8B and Gemini Flash models on benchmarks. Performs particularly well on vision.
MoE has 16 experts. 2 experts active = 6.6B parameters. All models support 128k context
MIT Licence allows commercial use. Hoping for Apache 2.0 one day.

Misc important AI news

Flux.1 rocked the AI image generation world by becoming the leading open source model that delivers stunning photorealistic images. Very little censoring, so it doensnt block you from generating images of copyrighted charachers, brands, real people etc. Great accuracy on human and text details. I was able to finetune the FLUX1 model by providing 15 of my personal photos, and after 40 min of training and $5, I was able to get a Shobhit & Deadpool poster generated : )
Anthropic introduced prompt caching for Claude that can deliver 79% faster and 90% cheaper responses by caching large prompt contexts.
Sakana AI, released The AI Scientist, that fully automates the entire scientific research process from generating novel research ideas, writing any necessary code, and executing experiments, to summarizing experimental results, visualizing them, and presenting its findings in a full scientific manuscript, and includes automated peer review. Each idea is implemented and developed into a full paper for ~$15 per paper.
Hermes 3 was released as 8B, 70B & 405B open weight fine-tunes of Meta Llama 3.1 that outperform the competion in its class. It is uncensored and very steerable to adapt to your personal/ corporate preferences vs. the model provider’s policies. Performing very well on Agentic flows and complex problem solving but it did have a scary moment of existential amnesia when asked ‘who are you?’
MultiOn's Agent Q sets a new major milestone for autonomous web agents, combining advanced search techniques, AI self-critique, and reinforcement learning to overcome current limitations, representing a substantial leap forward in autonomous agents capabilities. Big Aha - MultiOn’s Agents drastically improved the zero-shot performance of the LLaMa-3 model from an 18.6% success rate to 81.7%, a 340% jump after just one day of autonomous data collection and further to 95.4% with online search.
An Indian AI startup sarvam, that recently raised $41M, released multiple LLMs for speech/text that are natively trained on 10+ local indian languages. Open Sarvam-2B preview is outperforming Llama3.1-8B.
Friend, another AI companion device that’s always with you, constantly ambiently listening, summarizing and building an emotional rapport with you. It much simpler than Rabbit R1, Humane AI pin etc. and focuses on simple text messaging to communicate with you. Ad here.
Mark zuckerberg revealed that Meta now has 600,000 Nvidia H100 GPUs. Super simplifying. That’s ~2,400,000,000,000,000,000 operations/sec. If every single person on Earth (8 billion) could perform one floating point calculation in < 1 min, it would take 570 years !! for humanity to match what Meta can do in 1 sec. More here.
Writer released two new domain-specific LLMs: Palmyra-Med (clinical knowledge) and Palmyra-Fin (financial analysis), outperforming general-purpose models like GPT-4 in their respective domains.

Noteworthy interviews, articles etc.

Mark Zuckerberg Q&A with Jensen Huang @ Siggraph covering the future of compute and open-source
Eric Schmidt, former Google CEO gave an uncensored controversial talk at Stanford on future of AI and why Google is struggling.
Walmart CEO on 2Q earnings call proclaimed that their AI is delivering value at scale, and doing work of 1,000 associates.
IBM published a detailed report that shows the average cost of data breach has gone up to $4.88M, but using AI could save you $2.22M.

Important AI papers

HybridRAG, presents a novel approach of combining Knowledge Graphs and VectorRAG techniques, significantly improving information extraction and Q&A accuracy from complex financial documents. It addresses challenges in understanding domain-specific terminology and document formats, with potential applications beyond finance.
MINT-1T is a major advancement in open-source datasets for training large multimodal models. It is the largest dataset of its kind, with one trillion text tokens and 3.4 billion images, drawn from a variety of sources such as HTML, PDFs, and ArXiv.
ToolSandbox is a new LLM evaluation benchmark that tests how LLMs can maintain state, conversational flow, and leverage tools. Revealed where top-tier LLMs struggle in complex real-world tasks using tools.
MagPie-Ultra: 1st open dataset using Meta Llama 3.1 405B-Instruct FP8 to generate 50k+ synthetic instruction pairs for coding, math, data analysis, creative writing etc. It features challenging instructions including quality & difficulty scores, embeddings, topics, and safety scores.

--------
🔔 If you like such content, I encourage you to connect on my LinkedIn ♻️ Recommend your friends to subscribe to this free ‘AI with Whiskey’ newsletter
--------