It really is that good So OpenAI’s new o3 model just dropped—and I had to know: is it actually the best model out there right now? I didn’t want to rely on hype or vague vibes, so I built a custom benchmark: three tough, multi-step tasks that reflect real high-leverage knowledge work. I ran both o3 and Gemini 2.5 Pro through them, blind tested, and scored them with clear rubrics for reasoning, instruction-following, creativity, and language quality. The tasks? A Civilization Simulator where the model has to build a fictional society from scratch.A Peer Review Gauntlet simulating academic publishing with a paper, adversarial review, and rebuttal.A Multimodal Mystery Box—a riddle combining text and AI-generated images to solve. o3 crushed it. Stronger in logic, more agile in tone, better at maintaining coherence across modalities. It wasn’t perfect—but it was the best I’ve seen yet for complex knowledge work. I break it all down in my Substack post—full docs, side-by-side comparisons, detailed feedback. If you want to understand how these models actually perform on the stuff that matters—writing, analysis, synthesis, creativity—this is the post to read. And yeah, I even used o3 to help come up with the benchmark prompts. That’s how good it is. Link’s in bio. Let’s talk about what these results really mean for the future of work. #product #productmanager #productmanagement #startup #business #openai #llm #ai #microsoft #google #gemini #anthropic #claude #llama #meta #nvidia #career #careeradvice #mentor #mentorship #mentortiktok #mentortok #careertok #job #jobadvice #future #2024 #story #news #dev #coding #code #engineering #engineer #coder #sales #cs #marketing #agent #work #workflow #smart #thinking #strategy #cool #real #jobtips #hack #hacks #tip #tips #tech #techtok #techtiktok #openaidevday #aiupdates #techtrends #voiceAI #developerlife #cursor #replit #pythagora #bolt

Name: It really is that good So OpenAI’s new o3 model just dropped—and I had to know: is it actually t...
Duration: 159 s
Description: It really is that good So OpenAI’s new o3 model just dropped—and I had to know: is it actually the best model out there right now? I didn’t want to rely on hype or vague vibes, so I built a custom be

2:39 Jun 08, 2025 117,200 6,693

@nate.b.jones

467 words

So ChatGPT 03 is pretty good. I tried it all day yesterday. I put up a huge sub stack on it. It is continuing to surprise me. I gave it hundreds of pages of data and I said I am having trouble finding a pattern. There's something in my meeting style that I need to think through more carefully because I can't find where my meetings go right and where they go wrong. And ChatGPT 03 looked through hundreds of pages of data about my meetings and said, ah, I know. And it gave me an answer that was actually really helpful. It was actually unlocking an insight for me that I hadn't been able to find even though I was staring at the problem and actually living the problem. It's a profoundly insightful model. It's also a very thoughtful model. It uses tools so it can browse the web. It can use coding tools to do mathematics. And it thinks about when and where to use those tools. It doesn't just assume it needs to. It's the first model I've engaged with that feels like you're sparring with an intellectual partner and going back and forth and it feels like you're on equal terms. And I could go on and on. But this is a really good model. I know not everyone has access to it but I thought it was really important to talk about why this feels like such a big leap. For me this feels like as big a leap as we had when we first got to try ChatGPT. It's that big. It's absolutely massively better than previous models. And so I'm really excited to try it. I'm really excited because this is, it's a weird thing to say, but this is the dumbest models are ever going to be. We're going to keep getting improvements from here. And I'm excited to see what that means. And I'm excited today for O3. So if you would like to learn more about O3, I put up like a 23-page write-up right on my subsec. I do these long write-ups. I test them out carefully. I think the benchmarks are not super useful. Like the benchmarks are so overfitted right now which means that everyone knows what's in them more or less. The kinds of questions asked. I don't know that the numbers are super helpful. And so I wanted to try O3 on real-world tasks on things that like challenge it to think in ways that we would mimic at work and see if it's actually helpful. So I actually did tests and then mapped them to real-world skills and like it's very fun. So check it out. O3 is incredible and I look forward to you having it in your app soon.

No AI insights yet

Save videos. Search everything.

Build your personal library of inspiration. Find any quote, hook, or idea in seconds.

Create Free Account No credit card required

Original