12301 stories
·
36 followers

Devezer's Urn

1 Share

In a week of incredibly annoying and inaccurate AI discourse, I’m instead drawn to write about metascientific squabbles. Engaging with AI discourse means ignoring ignorant bait articles that falsely cast academia as a unipolar axis defined by a professor in the Pacific Northwest and looking at how AI is shaping or being shaped by our social systems.

So yes, I’m intrigued by this new AI-driven political science paper by Ryan Briggs, Jonathan Mellon, and Vincent Arel-Bundock. They use large language models to scour over one hundred thousand political science papers published after 2010 and find that only 2% report null findings in the abstract. They use a statistical model to argue that this number should be 80%. As Briggs said on social media, the authors think this finding is “disastrous” for political science.

This is the latest paper that invariably appears on my timeline claiming to have scientific proof that all science is wrong. Longtime readers know I’m no fan of quantitative social science. However, I’m even less of a fan of quantitative meta-science. I’ve written multiple posts about this bizarre practice. There’s no shorter path to a paper than data mining existing papers and tut-tutting a scientific community for its questionable research practices that pad CVs while failing science. People keep writing these metascience papers, and LLMs make writing them infinitely easier. I guess that means I have to keep writing these blogposts.

So let me ask you, argmin reader: what should the statistical summarization of political science abstracts look like? What is the stochastic process that generates pseudorandom numbers and produces arXiv PDFs? What are the sufficient statistics for this distribution? The authors don’t say. Instead, their paper posits a simplistic statistical model that Berna Devezer pejoratively calls the “urn model of science.” Political scientists reach into a magical Polya urn to find a hypothesis. The urn contains 75% true hypotheses and 25% false hypotheses. Then they test the hypothesis using a perfectly normally distributed experiment. The null hypotheses are rejected based on the nebulous holy parameters statisticians call “power” and “size.” Under this model, where the authors set the power to 25% (a number plucked from a previous meta-scientific screed by one of the coauthors), they compute that eighty percent number I mentioned above.

The authors think the gap between 80% and 2% is large and evidence of widespread research malpractice. They conclude, “The publication filter that we document here is not a benign pathology. It is a structural feature of our reporting practices that threatens the credibility and accumulation of knowledge in political science.”

Now, here’s the thing. I understand that metascientists think they can just appeal to a caricatured version of Karl Popper’s philosophy of science and think that when you get a finding like this, you have refuted the idea that people are practicing good research. But the one thing every scientist should learn is the immediate counterargument to Popperian falsification called the Quine-Duhem problem. When formulating a prediction about an experimental outcome, a hypothesis never stands alone. You need to append a long chain of auxiliary hypotheses about your mathematical models, your theoretical constructs, your conditions of ceteris paribus, your measurement devices, and so on, to make any prediction. When you see the opposite of what your hypothesis predicts, that means either the hypothesis is wrong or one of your auxiliary hypotheses is wrong. You need a logical conjunction of hypotheses to make a prediction. When an observation contradicts a prediction, any of the clauses in that conjunction could be to blame.

In the case of statistical meta-science, if your toy statistical model predicts a certain curve under “ideal” research practices, and you find a different curve, it’s possible that the curve derived from undergraduate probability has nothing to do with scientific practice.

I mean, come on, the urn model of scientific practice is more insulting than the stochastic parrots model of LLMs. We don’t do research by picking random experiments independently. Test choice is informed by past practice, advisors’ tastes, measurement constraints, and the whims of journal reviewers. We certainly don’t write papers about random tests.

And we should ask ourselves why these significance tests litter social science papers. It’s an unfortunate convention that everyone knows is harmful. To first order, everyone hates null hypothesis significance tests. Most people realize that there’s no faster way to strangle a discipline than with the logically incoherent garrote of the significance test.

Unfortunately, some die-hard proponents still believe that null-hypothesis significance testing will prevent people from being fooled by things that are too good to be true. Bizarrely, 100 years of significance testing has not yet convinced them that the promise of significance testing was always too good to be true.

Indeed, we’ve known it’s too good to be true since Neyman proposed the potential outcomes model. Even in the social sciences, we’ve known the statistical testing framework is useless for over fifty years. Significance testing “is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.” The null ritual of power analyses and 0.05 rejections is incoherent, and it’s just a game evolutionarily designed to grease the publication market. As ‪Mark Copelovitch perfectly put it, “the entire edifice causal identification champions have built over the last decade is mainly barriers to entry and primarily about methodological tastes.”

The hardcore statistical wing of metascience has strong, peculiar normative beliefs about what science should be. Somehow, we fix all of science if we preregister every possible significance test and publish all observed outcomes. This view is not scientific, of course. There is no evidence whatsoever that ornate causal identification strategies, complex regressions, and dozens of pages of robustness checks “fix” quantitative social science. And yet, the proponents disguise their irrational normative claims about mathematical statistics in a language of rationality. They claim all knowledge apparently hinges on significance tests and complex causal inference machinery.

However, as Copelvich put it, the credibility revolution’s central tenet that the only meaningful test of causation is a randomized clinical trial, whether real or imagined, is a matter of taste. There are many confusing mathematical tools you can learn to play this charade of credibility, but it’s just a framework of expertise that crowds out alternative explanation, interpretation, and intervention.

There is an allure to the quantitative end of metascience. Scientists feel they are best equipped to clean their own house. They think the historians, qualitative sociologists, and theorists who have described the nuances, idiosyncrasies, and power structures of contemporary science are all postmodernist weirdos ready to eat up the next Sokal Hoax. And yet, maybe we need those historians, philosophers, and sociologists and their disciplinary norms to understand more clearly the normative world that mainstream metascience wants.

Subscribe now



Read the whole story
denubis
29 minutes ago
reply
Share this story
Delete

Listen to the Mold || Crapshots 818

1 Share
From: loadingreadyrun
Duration: 1:52
Views: 3,317

Read the whole story
denubis
1 day ago
reply
Share this story
Delete

The AI Vampire

2 Shares

The AI Vampire

Steve Yegge's take on agent fatigue, and its relationship to burnout.

Let's pretend you're the only person at your company using AI.

In Scenario A, you decide you're going to impress your employer, and work for 8 hours a day at 10x productivity. You knock it out of the park and make everyone else look terrible by comparison.

In that scenario, your employer captures 100% of the value from you adopting AI. You get nothing, or at any rate, it ain't gonna be 9x your salary. And everyone hates you now.

And you're exhausted. You're tired, Boss. You got nothing for it.

Congrats, you were just drained by a company. I've been drained to the point of burnout several times in my career, even at Google once or twice. But now with AI, it's oh, so much easier.

Steve reports needing more sleep due to the cognitive burden involved in agentic engineering, and notes that four hours of agent work a day is a more realistic pace:

I’ve argued that AI has turned us all into Jeff Bezos, by automating the easy work, and leaving us with all the difficult decisions, summaries, and problem-solving. I find that I am only really comfortable working at that pace for short bursts of a few hours once or occasionally twice a day, even with lots of practice.

Via Tim Bray

Tags: steve-yegge, ai, generative-ai, llms, ai-assisted-programming, ai-ethics, coding-agents

Read the whole story
denubis
2 days ago
reply
Share this story
Delete

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt

1 Share

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt

This piece by Margaret-Anne Storey is the best explanation of the term cognitive debt I've seen so far.

Cognitive debt, a term gaining traction recently, instead communicates the notion that the debt compounded from going fast lives in the brains of the developers and affects their lived experiences and abilities to “go fast” or to make changes. Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it.

Margaret-Anne expands on this further with an anecdote about a student team she coached:

But by weeks 7 or 8, one team hit a wall. They could no longer make even simple changes without breaking something unexpected. When I met with them, the team initially blamed technical debt: messy code, poor architecture, hurried implementations. But as we dug deeper, the real problem emerged: no one on the team could explain why certain design decisions had been made or how different parts of the system were supposed to work together. The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.

I've experienced this myself on some of my more ambitious vibe-code-adjacent projects. I've been experimenting with prompting entire new features into existence without reviewing their implementations and, while it works surprisingly well, I've found myself getting lost in my own projects.

I no longer have a firm mental model of what they can do and how they work, which means each additional feature becomes harder to reason about, eventually leading me to lose the ability to make confident decisions about where to go next.

Via Martin Fowler

Tags: definitions, ai, generative-ai, llms, ai-assisted-programming, vibe-coding

Read the whole story
denubis
3 days ago
reply
Share this story
Delete

AI Doesn’t Reduce Work—It Intensifies It

1 Comment and 3 Shares

AI Doesn’t Reduce Work—It Intensifies It

Aruna Ranganathan and Xingqi Maggie Ye from Berkeley Haas School of Business report initial findings in the HBR from their April to December 2025 study of 200 employees at a "U.S.-based technology company".

This captures an effect I've been observing in my own work with LLMs: the productivity boost these things can provide is exhausting.

AI introduced a new rhythm in which workers managed several active threads at once: manually writing code while AI generated an alternative version, running multiple agents in parallel, or reviving long-deferred tasks because AI could “handle them” in the background. They did this, in part, because they felt they had a “partner” that could help them move through their workload.

While this sense of having a “partner” enabled a feeling of momentum, the reality was a continual switching of attention, frequent checking of AI outputs, and a growing number of open tasks. This created cognitive load and a sense of always juggling, even as the work felt productive.

I'm frequently finding myself with work on two or three projects running parallel. I can get so much done, but after just an hour or two my mental energy for the day feels almost entirely depleted.

I've had conversations with people recently who are losing sleep because they're finding building yet another feature with "just one more prompt" irresistible.

The HBR piece calls for organizations to build an "AI practice" that structures how AI is used to help avoid burnout and counter effects that "make it harder for organizations to distinguish genuine productivity gains from unsustainable intensity".

I think we've just disrupted decades of existing intuition about sustainable working practices. It's going to take a while and some discipline to find a good new balance.

Via Hacker News

Tags: careers, ai, generative-ai, llms, ai-assisted-programming, ai-ethics

Read the whole story
denubis
9 days ago
reply
This is so much me.
Share this story
Delete

The Coming AI Compute Crunch

1 Share
Why DRAM shortages, not capital, will define AI infrastructure growth through 2027

Read the whole story
denubis
10 days ago
reply
Share this story
Delete
Next Page of Stories