One of my new favorite questions to ask people I meet is “how are you using AI in your daily life?”
As AI becomes both more capable and prolific, every person’s individual experience becomes a unique eval, particularly if they’re doing something interesting and not AI/software related. These kinds of personal experience-based evaluations differ from traditional benchmarking in at least two ways:
signal-rich: benchmarks are either (1) increasingly saturated and attempt to quantify vibes or (2) contaminated. Personal use is much more qualified and contextual, but as a result provides far more richer information on capabilities of AI systems in the messy real world
actual contextual use: benchmarks and tests only focus on measuring capability. AlphaFold was created because of a benchmark competition, but the proliferating number of ways it is used as a tool is a sociological phenomena that is downstream of and separate from the benchmarks. For instance, CodeForces dataset measures capability in programming, but that benchmark does not capture the full complexity and numerous ways coders and AI can interact together within a large code base.
Here are a few of my favorite examples of people who are doing valuable write-ups on their use of AI in their respective fields or interests, with my commentary included:
- on building a fusor from scratch with claude
Claude as a new knowledge management system - dump every file and documentation and ask it questions! Useful for first-time hacking projects and as a way to accelerate learning. But at the end of the day, it still requires a cracked waterloo kid to actually build the dang thing
- on using OpenAI Deep Research for policy research and memo writing
difference between Gemini Deep Research and OpenAI Deep Research; enormous research potential but shortcomings in LLMs overbasing to the token corpus (which in policy world is abundant in tokens but often lacking in substance)
- provides three case studies on how AI can successfully be used in historical research now
Transcribing and translating early modern italian; analyzing icons in early medical textbook; novel interpretations of new historical correspondance. All fascinating case studies and positive use cases.
This makes sense: 2025 is, after all, being hailed as the year that PhD-level AI agents will flourish. I can personally report that, in the field of history, we’re already there.
- on how they use (or don’t use) AI for their small independent hotel [h/t ’s substack note]
A deep dive on what running a hotel and restaurant actually requires (lots of receipts…), down to seemingly mundane tasks. Perspective on how useful (or not) AI is for a unique set of content tasks but also how AI-powered search and enshittification shifts small business owners think about online presence.
r/math user providing context on DeepMind’s AlphaGeometry results. Also Terence Tao’s continually thoughtful commentary on the use of AI in math [Recorded lecture overview, much richer Scientific American interview, and interesting posts on mathstadon]
Most of what is exciting/new to me from Terence’s remarks are not about AI but about Lean, the programming language for theorem solving. By formalizing mathematical propositions and adding in logic verification in the compiler, it means math collaboration can scale much more than before. Comments on AlphaGeometry and other ways of using AI Tao identifies are mostly clarifying on what “AI” actually looks like
Aidan Toner-Rogers preprint on the staged introduction of a graph neural network surrogate model1 to the R&D lab of a large chemicals company in the US2
Helps the top quartile of scientists the most while doesn’t help bottom half of scientists at all. Shows strong results in papers, patents, and products, with increasing lag from AI adoption. Also worth noting that scientist are uniformly less satisfied with their jobs because they spend more time on experimental verification rather than ideation.3
Nicholas Carlini on how he uses AI for programming
There are many versions of how LLMs are used by software engineers, but Nicholas is one of the more thoughtful writers in the space (and this is also one of the more comprehensive and recent treatments).
- on using AI to categorize housing regulations at scale
With AI, we can now see (measure) the state much better. If AI is an intern, then you should expect fields that require lots of lowly paid mental labor e.g. research assistants doing menial archival categorization, to see huge benefits. I’d love to see a social sciences review of how AI is transforming social research.
Share your favorite write-ups as well!4
Of course, I am not the only person who is advocating for more direct, hands-on experience with AI systems over cold, narrow test benchmark scoring. AI developers are increasingly partnering with institutions to create structured settings where new users can actively try AI.
A few recent examples:
1000 scientists across 9 Department of Energy national labs use OpenAI and Anthropic models for a day as part of an “AI jam session”56
Pennsylvania state government pilot with OpenAI for state employees to use AI models. 175 employees participated, almost half of whom have never used chatGPT beforehand.
AI Across America initiative, which brings together local communities and elected members of Congress to actually engage with positive use cases for AI in their constituencies.
To view these kinds of events, often subsidized by AI model developers through free credits, merely as buying customers is overly simplistic. While AI companies continue to claim AGI obviates the fundamental startup task of finding product-market fit, they cannot resist the gravitational demands of capital. They will eventually need to identify and create products that users will pay for. These events distribute the labor of product discovery—offloading it across many sectors, with institutions acting as testbeds that generate user feedback and data at scale.
What do the legacy institutions get out of this exchange? Any direct benefit they receive from one day of their employees playing around with AI is likely to be diffuse. The benefit to them is that they earn reputational equity in the hype cycle. The AI hype cycle creates a narrative premium—public and private institutions can capture some of this premium by adopting early, staking their claim in the hype cycle without necessarily having to demonstrate its positive impact on their bottom line. In other words, these institutions are temporarily rejuvenated in the eyes of the market and receive the same benefit of the doubt that is afforded to still unprofitable startups. And this is not a criticism of this process! It is arguably fundamental to the boom cycles and bubbles that drive investment in new technologies a la
and Tobias Huber7.The interplay between technology disruptors and legacy institutions in bubbles is important not just for market development but for thinking about technology competition.
makes the argument that for General Purpose Technologies, we need to consider both innovation (developing new or more powerful technology) and diffusion (driving adoption of existing technologies) for national competitiveness. If benchmarks assess and drive innovation, then these kinds of convenings are one way to drive diffusion.In this way, structured pilots for institutions to leverage AI are situated at the intersection of earning a stake in speculative narratives and diffusing emerging technologies into national capability. Ultimately, it will be the countries that scale AI, in whatever form it will look like, across sectors and convert frontier capabilities into broad economic and strategic leverage.

So ask your friends, your uber drivers, your extended family members how they use AI. There has never been a more exciting (or for a few occupations, terrifying) time to be using AI. Write up your experience. Maybe encourage a few friends to spend a weekend playing around with it together. And in so doing, you might just help us beat China.
👊🇺🇸🔥
A graph neural network is a more niche topology most used by material scientists and biologists. Note the positive results of this study have nothing to do with the LLMs being developed by OpenAI, Anthropic, etc
Technically this example really belongs in the second half of this essay, but I think it is cool enough to be pulled up front
This is why I continue to believe autonomous experimentation i.e. self-driving labs are a critical complement to realizing the full potential of AI in scientific discovery.
Honorary mention for the 2023 HBS, BCG, and OpenAI study, but this example is too corporate to be worth including in the list.
Fun fact: it was called a “jam session” to avoid rules around national labs hosting “conferences”.
You probably didn’t hear about this because it happened the same day as the Zelensky-Trump-Vance white house meeting.
https://press.stripe.com/boom