The chance of a bridge failing, of an asteroid striking the earth, of whether your child will get into Harvard, and of whether AI will kill everyone are all things that can be expressed with probability, but they are not all the same type of probability. There is a structural difference in the "probability" between a bridge collapsing with a $10^{-6}$ chance, versus saying "my P(doom) is 15%".
I know what you are probably thinking: "We've heard this before, you are going to go down some rabbithole about how P(doom) is tribally divisive, or that it distracts from gears-level models, or maybe a Popperian argument that it is philosophically meaningless to assign probabilities to unprecedented events."
Don't worry! This post is none of those things. (If you want to double-check, feel free to skip down to the Why Is This Different section.)
I am not anti-Bayesian at all. However; I am going to argue that probability estimates for unprecedented catastrophes are "non-canonical". This connects to the long tradition of formalizing Knightian uncertainty through sets of priors (Gilboa & Schmeidler, 1989), imprecise probabilities (Walley, 1991), and robust control under model misspecification (Hansen & Sargent, 2008). In comparison to these serious works my contribution is very modest and somewhat adjacent. I use algorithmic information theory and computability to characterize the structural conditions under which Bayesian updating fails to resolve framework-level disagreement. This provides a criterion for when Knightian uncertainty is irreducible by data.
First, let me be clear with my own definitions; what do I mean by a canonical probability? A canonical probability is one that is stable across reasonable scientific frameworks given the available evidence. For example, a $10^{-8}$ probability of an extinction-level asteroid impact is canonical because mechanistic models of orbital dynamics are continuously updated by routine data. This forces all reasonable scientific frameworks to converge to a similar narrow estimate.
However, genuinely unprecedented catastrophic risks are non-canonical because non-catastrophic data structurally refuses to wash out your priors. This creates an illusion of precise catastrophic probabilities that stems from conflating three distinct mathematical concepts related to evidence and discriminability:
- The likelihood discrimination from data: Log Bayes factors / log-likelihood ratios from empirical observations.
- The prior / encoding discrimination (Solomonoff / Kolmogorov): Algorithmic complexity penalties and description lengths.
- Provability / computability discrimination (Gödel / arithmetical hierarchy): What can be mathematically certified by a theory $F$ or decided by an algorithm.
All three can bound our convergence of posterior probability estimates, but they are not interchangeable. In fact, I will show how unprecedented risks (non-canonical events) evade all three layers of resolution. This essay proceeds through the following steps:
- Data Doesn't Resolve Ontology - defining canonical probability, the Evidential Screening Property, and why unprecedented catastrophes fail to meet the standard.
- The Canonical Boundary - assembling these into a heuristic diagnostic index $\mathcal{S}(n)$ that continuously measures how screened a given risk is.
- Computability Limits on Screening Verification - no algorithm can universally certify escape from the screened regime; the certification problem is $\Sigma_2^0$-complete.
- Transitioning to Program Space - analyzing finite models recovers the same limits via the Solomonoff Universal Prior.
- Why Rational Agents Must Disagree - resource-bounded agents must extensionally disagree on screened risks.
- The Complexity Inversion and Mixture Dispersion - mixture averaging cannot rescue consensus.
- Differential Screening: The Random Walk of Doom - evidence streams cannot force convergence under differential updating.
- Why This is Different - distinguishing this argument from standard critiques of P(doom).
- How to Restore Canonical Risk - agreed precursors with aligned causal joints can break screening and restore canonical estimates.
I. Data Doesn't Resolve Ontology
I am going to use the battle between Nick Bostrom and Eliezer Yudkowsky over AI risk as an example in this essay because it is a great example of genuinely different ontological starting points. However; I should be clear that this work is really aimed at the broader class of unprecedented catastrophes. AI is just on everyone's minds these days.
Bostrom recently wrote a paper framing building AGI as a "risky surgery for a condition that will otherwise prove fatal" (everyone alive today will die if we do nothing).2 In response, Yudkowsky wrote a mocking parable called the "Asteroid of Immortality," arguing that AGI is an incoming asteroid, and Bostrom's reasoning amounts to the logic of a cult rather than a surgeon.
Bostrom and Yudkowsky can state their core arguments simply and natively in their own frameworks, even though they might be very different. When I say framework the concept is a programming language of a worldview. A framework determines which explanations are naturally simple to express, and which are unnatural and super complex. For our example imagine two ideal rational agents using two different Universal Turing Machines as their formal description languages. Machine $U$ natively encodes Yudkowsky's alignment concepts. Machine $V$ natively encodes Bostrom's surgery-framing concepts. By standard algorithmic information theory, translating between these frameworks incurs a massive, fixed coding penalty (the Kolmogorov-Chaitin Invariance Theorem).
An important caveat is that the invariance constant $c_{UV}$ can in principle be made arbitrarily large by choosing perverse UTMs. The claim here is not about adversarially constructed languages but about realistic descriptive frameworks whose translation costs are large because they reflect genuinely different ontological commitments not because they were engineered to maximize divergence.
It would be good to know who is right about the AI future, so you might think the best step is to gather more data so that our priors wash out and the winning framework is clear. However; unprecedented catastrophes have a unique problem I call The Evidential Screening Property. Another year without world-ending events gives us more data, but this data fits equally well into a model that predicts it's all over and a model that predicts it's all ok.
The reason priors don't necessarily wash out is because we are not working in rigid parametric statistics. In our computationally universal spaces, we can draw on standard results in universal sequence prediction (Hutter, 2001; and Neth, 2023). Neth demonstrated that UTMs are not guaranteed to converge. His result established the qualitative possibility of permanent language dependence. The evidential screening framework adds structural conditions under which this language dependence becomes the dominant term for a specific, well-defined class of problems.
These spaces natively contain a Splice Model (a special case of the switching distributions studied by Herbster and Warmuth, and originating with grue/Goodman (1955)). Essentially, a robust framework can always have a perfectly safe past until the catastrophic future event. Prolonged safe data (not dying each year) does continuously falsify models that predict early doom (short splices) and cause the short-term doom probability to shrink, but it provides no discriminative evidence against splices predicting doom at later horizons.
The aggregate doom probability decays, but it is at a very slow rate that is bottlenecked entirely by the $K(n)$ algorithmic complexity penalty of the surviving delayed switch-times. This keeps the odds fiercely prior-dominated on any practical planning horizon. It's the turkey living the good life until the day before Thanksgiving.
To ensure a proper conditional continuation after a non-event, for a fixed, framework-independent global constant $\varepsilon \in (0,1)$, I define the doom continuation as a small mixture $\tilde\mu_{\text{doom}} := (1-\varepsilon)\mu_{\text{doom}}+\varepsilon\mu_{\text{safe}}$, guaranteeing $\tilde\mu_{\text{doom}}(x_{1:n})>0$ for the observed prefix. For the splice constructed at the current prefix length $n$, the specific choice of $\varepsilon$ does not affect the prefix likelihood beyond an $O(1)$ constant and does not alter the qualitative screening bounds.
For every finite prefix length $n$, universal model classes contain splices whose likelihood on $x_{1:n}$ is identical to the safe model. Before the splice point, the Bayes factor from data is exactly 1. The only discriminator is the $K(n)$-bit description length needed to specify the transition point in prefix-free Kolmogorov complexity (approximately $\log_2 n + 2\log_2\log_2 n + O(1)$ bits; the self-delimiting overhead is required for the prior over switch-times to be normalizable). This means that finite non-catastrophic data cannot eliminate hypotheses about doom in later periods. These surviving "doom-later" splices require larger transition times so the odds are entirely prior/encoding-dominated until the hypothesized switch-time passes. The only out is if additional agreed precursors leak into the data.
To escape these potentially problematic splices, you might try to restrict your models to causally coherent ones, meaning a more realistic model based on mechanisms you observe in the world. However; this just transitions the problem to Structural Dormancy. Let's be explicit about what we observe: let $Y_t$ be ordinary observables and $D_t$ be a catastrophe indicator. If something has "perfect dormancy" this means that conditional on survival ($D_{1:n}=0$), the observational likelihoods are identical:
The content data $Y$ gives no discrimination power beyond the trivial fact of survival. So the Bayes factor from observables $Y$ is exactly 1 (screened); any aggregate Bayes factor comes entirely from the survival process $D$, that is, strictly from prior assumptions about hazard timing.
A critic might attempt to further escape by claiming that their specific causal graph is "leaky" or active. The idea here is there is some subtle signal that could be found that tells you about the catastrophe. Unfortunately, the warning signs might be real, but our understanding of them is an illusion; this is the notion of Causal Underdetermination. This says that given only observational data on present-day variables, any leaky catastrophe mechanism can be compiled into a dormant mechanism by introducing latent variables that reproduce the exact same observational marginals. This compilation preserves $P(\text{observables})$ but generally does not preserve interventional semantics. It blocks identifiability from passive data, not from all conceivable interventions.
The strength of this result depends critically on the modeling language assumed. In a Turing-universal modeling language that permits arbitrary latent-variable topologies without structural penalties, the algorithmic overhead to wrap this construction is merely an $O(1)$ constant. With only a few lines of code you can invent a fake mechanism that looks exactly like the warning signs of a catastrophe without actually causing one.
However, standard causal modeling frameworks impose structural constraints like faithfulness, bounded indegree, prohibitions on certain latent topologies, that can block this compilation or make it substantially more expensive than $O(1)$. If you adopt such constraints, the underdetermination argument weakens: some leaky models may resist dormant compilation, and passive data could in principle discriminate between them.
The evidential screening argument then rests more heavily on the splice model and on whether the practitioner's structural constraints are themselves framework-dependent choices. I believe the permissive model class is the most epistemically honest setting for analyzing unprecedented risks (where we lack strong grounds for imposing structural priors), but readers who accept faithfulness or bounded indegree as substantive constraints should note that causal underdetermination provides weaker support for screening in their setting. The splice model and dormancy arguments (which do not depend on this assumption) still apply independently.
Because the data-only likelihood ratio is structurally bounded near 1, we find ourselves in the exact domain of Charles Manski's foundational work in econometrics called Partial Identification (Manski, 2003). The data here does not force a point-estimate, it only constrains the parameter to a hugely ambiguous, prior-free identification region $[\underline{p}, \overline{p}]$. The choice of your descriptive language is just an arbitrary coordinate selected in this region.
II. The Canonical Boundary
Using the ingredients above it is possible to assemble a heuristic diagnostic index for when a risk estimate is canonical or non-canonical. We cleanly separate the two relevant quantities:
Let $L(n)$ be the data-only log-likelihood ratio between the best-fitting members of two competing model families (e.g., safe vs. doom):
where both families are restricted to semimeasures assigning strictly positive probability to the observed prefix. This measures how much the data favors one family's best explanation over the other's.
Let $C(n)$ be the prior coding/complexity penalty paid for splices or framework translations (e.g., $C(n) \approx K(n)$ for splice-time coding). In the splice model setting, this penalty grows slowly with $n$. However, a separate and typically larger constant also bounds discriminability, the cross-framework translation cost $c_{UV}$, which measures how many bits it costs to simulate one framework inside another. For the screening diagnostic, $c_{UV}$ is the relevant denominator because it captures the irreducible gap between competing ontologies. This is the very gap that data must overcome to produce consensus. The splice penalty $C(n)$ contributes to prior stickiness within a single framework, while $c_{UV}$ governs the between-framework disagreement that makes an estimate non-canonical.
A key parameter governing $L(n)$ is the dormancy leakage $\lambda$, defined as a uniform upper bound on the per-step conditional log-likelihood ratio across the competing model families:
Assuming $\lambda$ is finite (which requires bounded per-step divergence between the families), each observation contributes at most $\lambda$ bits to $L(n)$, so $L(n) \le n \cdot \lambda$. Meanwhile, complexity effects like the $K(n)$ splice penalty act independently as prior stickiness $C(n)$, keeping aggregate discriminability strictly encoding-dependent.
Using Lemma 1.0, I define a heuristic diagnostic index for screening strength at observation horizon $n$ by the ratio of data discriminability to translation cost:
where $c_{UV}$ is the symmetric translation cost between the competing frameworks.
This criterion is continuous. Here are a few examples:
Asteroid impact ($\mathcal{S} \approx 0$): Orbital mechanics provides dense, ongoing observations with large $\lambda$. Every tracked near-Earth object directly updates the impact parameter. $L(n)$ grows rapidly with each observation campaign, overwhelming $c_{UV}$.
Nuclear war ($\mathcal{S}$ intermediate): Close calls provide some signal about deterrence fragility, so $\lambda$ is small but nonzero. $L(n)$ grows slowly over decades of non-use. Framework disagreement yields moderate $c_{UV}$. The risk sits in the partially identified middle zone.
AI catastrophe ($\mathcal{S} \approx 1$ under dormancy): If the catastrophe mechanism does not measurably leak into present-day benchmark scores or capability evaluations, then $\lambda \approx 0$ and the Bayes factor $L(n)$ from data remains stalled. Meanwhile, $c_{UV}$ between alignment-centric and surgery-framing ontologies is massive. The estimate remains deep in the non-canonical regime. 3
There is something important to note which is that whether a specific risk satisfies the dormancy condition is itself an empirical domain judgment, not a mathematical theorem. (See above for some good notes on empirical support for the AI question.) What I am proving is that when the dormancy condition holds no amount of additional non-catastrophic data, causal restriction, or formal verification can rescue canonical point-estimates.
III. Computability Limits on Screening Verification
Above we have been reviewing the difficulties of current data. A natural question is whether there will ever be an unambiguous piece of data that might arise that would definitely break the ambiguity. Perhaps for AI it might be a "slow takeoff" scenario where proponents frequently suggest that as AI capabilities scale we will gather clear warning signs before any catastrophe. The data may appear, but mathematically we can't be sure to trust it. There is no algorithmic procedure that can universally certify that the evidence has permanently escaped the ambiguous zone.
The reason why is that to guarantee escape from the screening regime, your formal mathematical system $F$ (like ZFC) must formally rule out the existence of causally coherent models that remain dormant now, but will unpredictably violate safety bounds later. Using Gödel's Second Incompleteness Theorem, it is possible to construct a total computable "Wait-and-Dive" prediction model - a model that looks completely safe on the surface, but has a secret tripwire. The model waits indefinitely, but when the tripwire is ever crossed it is programmed to dive into a catastrophic prediction. To keep everything perfectly as an interaction between just simple bits of code and the probability of disaster, let's assign integer penalty bits $m_t \in \mathbb{N}$ at each step, routing a deterministic hazard penalty through a latent catastrophe node. Most of the time, $m_t = 0$, and the model perfectly matches the safe prefix. However, if it ever discovers a proof of $\neg \mathrm{Con}(F)$, it permanently transitions, escalating the hazard penalty to permanently breach an arbitrary, unbounded data-likelihood screening threshold $B(n)$.
Assuming $F$ is consistent (aka standard mathematics isn't fundamentally broken), this condition is never met which makes the model permanently safe. Yet, for $F$ to formally certify that this specific working prediction model won't eventually break the bounds, $F$ would have to prove its own consistency. Therefore even models that look perfectly safe can harbor latent wobbles whose permanent safety is strictly unprovable in $F$.
Let $F$ be a sound formal system extending PA. There exists a strictly total, permanently safe prediction model $p^*$ whose permanent safety $F$ cannot formally certify - because doing so would require $F$ to prove its own consistency.
See Appendix, Ground 2 for the full construction.
It is crucial to keep the epistemological categories here crisp. Gödel's Incompleteness Theorem implies that a specific sound formal system $F$ cannot universally certify the permanent safety of specific wait-and-dive models. However, when we elevate the question and ask whether any algorithmic procedure can globally decide if a model has permanently escaped the screened regime, the provability barrier translates into an absolute computability limit. So "Has this prediction model permanently escaped the screening bounds?" and "Will humanity permanently lose its potential?" share the exact same logical structure! They are both saying Does there exist a point in time after which a condition permanently holds? ($\exists t \, \forall s > t$).
Obviously you can never confirm a "forever" state only watching a finite simulation because humanity, or the model, could theoretically recover at step $s + 1$. This places both problems at the $\Sigma_2^0$ level of the arithmetical hierarchy (one full level above the Halting Problem). They are mathematically $\Sigma_2^0$-complete. I think this shows a rather important unification: the exact same uncomputable barrier that prevents an algorithm from predicting permanent doom also prevents any computable procedure from universally certifying that evidence has escaped the screening regime.
Define $\mathrm{Escapes}(p) :\equiv \exists n_0 \forall n > n_0 (S_n > B(n))$. The set $\{p : \mathrm{Escapes}(p)\}$ is $\Sigma_2^0$-complete.
See Appendix, Ground 2 for the proof via reduction from $\mathrm{FIN}$.
IV. Transitioning to Program Space
Until now I have been focused on the limits of prediction for infinite timelines in the future (Cantor space). Now I want to pivot and instead look at the code directly. Any prediction model exists in program space meaning it is just a finite string of code on a hard drive. This is an interesting topic because analyzing a finite, closed program feels like it should be much more tractable than predicting the continuous and infinite future.
Interestingly, however, finite models have a mathematically equivalent constraint known as the Solomonoff Universal Prior. By weighting classification errors based on how simple the programs are, we end up recovering the exact same limits we faced when predicting the infinite future. I prove in the appendix (Theorem 3.1) that any computable procedure attempting to correctly classify screening escape faces a novel Discrete Precision-Robustness Tradeoff.
The proof shows that to drive your error down your code must perfectly classify natively simple boundary halting cases. This strict balancing act proves that for every factor-of-two reduction in error (pushing error below $2^{-k}$), your classifier must effectively decide the $\mathbf{0}'$-halting problem up to length $k$, which information-theoretically requires the equivalent of an $\approx k$-bit prefix of Chaitin's constant. However; evaluating these limits places us in the $\Sigma_2^0$ world, where $\mathbf{0}'$ is the natural baseline oracle, and its halting probability $\Omega_U^{\mathbf{0}'}$ is 2-random. This means its prefixes are algorithmically incompressible.
This means that computation can't generate these bits, data can't supply them, and you, the programmer, must hardcode them. Your model's accuracy is permanently capped by your own arbitrary assumptions. Your final "prediction" is just a reflection of the biases you brought with you from the start.
V. Why Rational Agents Must Disagree
The lack of empirical data and the limits of computation corner us into a state that guarantees permanent disagreement. I am greatly indebted to Neth (2023) for his qualitative framework which proved that computable approximations to Solomonoff prediction can permanently disagree, by showing that every computable prior must assign probability zero to some computable sequence, thus breaking the mutual absolute continuity that guarantees convergence. Neth also explicitly connected this to Nielsen and Stewart's (2021) polarization theorem to conclude that the subjective element in UTM choice is not guaranteed to wash out. I modestly build on Neth's important findings and path in three respects: I derive a quantitative lower bound on the divergence (Theorem 3.1), I identify which specific models agents will disagree on (the foreign-language boundary witnesses of Theorem 3.2), and I show that the bounded-resource regime makes disagreement unavoidable rather than merely possible with positive probability.
Between any two distinct frameworks, there is an irreducible translation cost known (the cross-UTM translation overhead). All real-world predictive models operate under a strict complexity budget (they only have a limited number of bits to spend on being accurate) so the model is forced to be selfish. To minimize its expected error, an optimal classifier must spend its entire limited bit-budget optimizing for its own native language. (To be abundantly clear this does not violate standard Bayesian convergence for routine events where evidence is plentiful. This is about the domain of unprecedented risk where passive data provides little to zero of discriminative power.)
Assuming classifiers operate rationally by minimizing expected error $\epsilon_U(\hat{E})$ subject to a strict complexity budget $K_U^{\mathbf{0}'}(\hat{E}) \le k$ (a "bounded resource regime"), each framework's optimal classifier must spend its limited bit-budget on its own language.
This forces it to neglect boundary cases that are natively simple in the foreign language because achieving accidental correctness on those cases would require more channel capacity than the bit-budget permits. Correctly classifying these foreign boundary cases natively requires compressing the algorithmic randomness of the rival framework's hardest instances without exceeding the available residual bit-budget. This is not a mathematical possibility because of the irreducible cross-UTM translation cost.
The same class of algorithmic machinery that forces optimal doom classifiers to diverge on specific computable futures forces optimal screening classifiers to extensionally disagree on specific, computable prediction models. Under the strictly bounded-resource regime (Assumption 3.1), two rational agents will evaluate the exact same causal model on the exact same data and mathematically disagree on whether it has escaped the screening bounds.
Under a bounded complexity budget where the cross-UTM translation overhead $c_{UV}$ exceeds the available residual capacity, optimal classifiers for frameworks $U$ and $V$ must extensionally disagree on at least one specific, computable prediction model.
See Appendix, Ground 3 for the proof by contradiction via the Invariance Theorem.
VI. The Complexity Inversion and Mixture Dispersion
A natural idea at this juncture might be to fall back to the Solomonoff mixture itself because by taking a weighted average over all computable models, we may hope that a diffuse mass of moderate models will anchor the aggregate P(doom), and keep the center of mass stable while reducing extreme swings. But this structural stability fails because of The Complexity Inversion.
Switching frameworks flips which type of futures are considered simple. Under Yudkowsky's Machine $U$, deceptive alignment may be concise. Under Bostrom's Machine $V$, expressing mesa-optimization may require a massive translation, while a QALY-maximizing cluster is simple.
The dynamic here ends up acting less like a balancing scale and more like an avalanche. The prior probability is strictly tied to how simple a hypothesis is to describe and changing your descriptive language changes the weight of every model in the mixture. The Mixture Probability Dispersion theorem proves that if one framework has an algorithmic advantage for a certain outcome it drags the entire bulk of moderate hypotheses in that direction. This causes the anchoring effect to fail as the gap between the two worldviews is bounded below:
This quantity grows with the complexity inversion $k$, approaching $(1-\epsilon)(p_h - p_l)$ for large $k$. The mixture cannot anchor the estimate to a stable midpoint.
I should briefly pause again here. I love postmodern fiction, but I am not a postmodernist or a nihilist. I don't mean to imply that math or data are fake or anything like that (which you could accidentally take away if you think the implication is "you're just saying assign random P(doom) based on your feelings and nothing matters.) I'm definitely not saying this, and the divergence is mathematically capped by the translation cost between frameworks.
VII. Differential Screening: The Random Walk of Doom
So if the prior cannot force consensus, the final remaining hope is that a continuous stream of observable evidence eventually will. However, formal epistemology shows otherwise. Nielsen and Stewart (2021) rigorously proved that rational agents whose priors fail mutual absolute continuity can permanently polarize on the exact same stream of infinite evidence. Why might this occur with unprecedented catastrophes like AI doom?
It occurs because different frameworks carve up the hypothesis space using misaligned causal joints. As a result, the exact same piece of evidence can cause them to update in opposite directions. I'll call this mechanism Differential Screening. That computable approximations to Solomonoff prediction satisfy the preconditions of Nielsen and Stewart's polarization theorem was first observed by Neth (2023), who showed via Putnam's (1963) diagonal argument that any two computable priors must differ in which sequences they assign probability zero, breaking mutual absolute continuity. What follows extends this observation to the evidential screening regime, where I formalize the inter-framework dynamics as a bounded-increment random walk and derive convergence timescales. This happens all the time with AI: discovering zero-day exploits makes one person's risk estimates go up (it is an example of uncontrollable capabilities) and makes another's risk go down (it is a warning shot that boosts regulation).
Assuming the log-odds gap evolves as a simplified symmetric random walk with independent bounded increments, I conjecture that when a substantial fraction $\alpha$ of observations are widening in this way, the log-odds difference between frameworks follows a bounded-increment random walk rather than a converging asymptote. Under this conjecture (which I have not formally proved; see footnote 4), the expected time to consensus would scale as:
which diverges as $\alpha \to 0.5$. Universal Solomonoff priors sum over all computable semimeasures, so they are strictly mutually absolutely continuous. They assign positive probability to all computable hypotheses, differing initially by at most a multiplicative factor of $2^{c_{UV}}$. Therefore, by standard martingale convergence, computationally unbounded Bayesian agents must eventually converge in the infinite limit. This is a genuine mathematical guarantee, but it comes with no timeline.
The Nielsen-Stewart impossibility theorem, which proves that rational agents can permanently polarize on the same evidence stream, requires a strictly stronger condition that priors that fail mutual absolute continuity. Ideal Solomonoff priors satisfy absolute continuity, so the impossibility theorem does not directly apply to them.
However, the theorem becomes directly applicable the moment we model realistic agents. Any resource-bounded agent must truncate its hypothesis space, effectively assigning zero probability to hypotheses beyond its complexity budget. This truncation breaks mutual absolute continuity between agents using different frameworks, because each agent zeros out different hypotheses; precisely the foreign-language hypotheses that are too expensive to represent. Once absolute continuity fails, the Nielsen-Stewart mechanism activates and differential screening can drive permanent polarization, not merely delayed convergence. The $c_{UV}$ translation penalty determines which hypotheses get truncated and thus how severely absolute continuity is broken. 4
VIII. Why This is Different
If you came from the introduction: welcome! And if you just read through the essay above, you might be tempted to map this argument onto existing debates. Especially because one type of unprecedented risk includes P(doom), and I used it as an example, it is critical for me to explain why the formalism above puts us into a different bucket.
Usually critiques of P(doom) fall into one of three buckets:
- A Psychological Critique: something like "humans are bad at forecasting, poorly calibrated, and overly influenced by tribal vibes or sci-fi tropes."
- A Complexity Critique: someone says that the world is too messy, the causal graphs are too dense, and our current models are simply not gears-level enough to produce a reliable number.
- An Epistemological Critique: usually a standard frequentist or Popperian objection that genuinely unprecedented events cannot be assigned probabilities because they lack reference classes, making the exercise philosophically moot.
This essay is different because it works at multiple levels of idealization, and the a fortiori logic differs for each. Grounds 1 and 2 assume the perspective of an ideal Bayesian agent with infinite compute: if even a computationally unbounded agent cannot derive a canonical point-estimate (because data is screened and verification is $\Sigma_2^0$-hard), bounded human scientists certainly cannot either. Ground 3 shifts to a different argument. It assumes agents with finite complexity budgets and shows that the boundedness itself forces rational agents to extensionally disagree. This is not an a fortiori weakening; it is a direct characterization of the regime real agents actually inhabit. The three grounds are complementary: the first two show that infinite resources don't save you, and the third shows that finite resources actively hurt you. I am also arguing this way because it is the lingua franca of rationalists and the LessWrong community who are most interested in unprecedented catastrophes as it relates to AI risk. The results I outlined in this essay are different than many other arguments because:
- The architecture is the argument: Splices block raw data, underdetermination blocks passive causal data, and uncomputability blocks verification. This sequential architecture logically makes a strong epistemic quagmire.
- I used standard results from other fields: Thankfully, much of the hard work on this topic has already been done by real academics. I draw heavily on Manski (2003), Pearl (2000), Hutter (2001; 2003), and Nielsen & Stewart (2021), and apply their works to identify the structural boundaries of unprecedented risk.
- Uncomputability is total: The exact same $\Sigma_2^0$-complete barrier that prevents you from classifying doom is the same that prevents you from classifying if your evidence about doom is informative.
- It is not epistemic nihilism: As I mentioned previously, I am not throwing my hands up and saying, "We can't know anything, so assign whatever number feels right." The math strictly bounds the divergence by the translation cost between frameworks ($c_{UV}$). I am tearing down the illusion of single point-estimates specifically to replace them with mathematically rigorous alternatives.
There is a widely believed belief that a universal prior over all computable hypotheses will eventually converge to the truth. In reality without formal bounds there is no way to determine whether this convergence occurs on any timescale relevant to human planning or whether structural features of unprecedented risk prevent it entirely. Each theorem I have worked on isolates a specific claimed escape route, raw evidence accumulation, causal restriction, formal verification, mixture averaging and deriving the quantitative conditions, under which it fails.
Fortunately for critics, the mathematical approach makes the paper easier to refute: all you need to do is exhibit an observational leakage $\lambda$ large enough to breach the screening threshold, or a differential screening fraction $\alpha$ small enough to restore convergence within the planning horizon. If and when that happens the framework yields canonical estimates.
IX. How to Restore Canonical Risk
The probabilities of unprecedented catastrophes are non-canonical because their precise numerical value is a syntax error generated by compilation costs, which is trapped by evidential screening, only partially identified by data, and then choked by uncomputable predicates. So what are we supposed to do? Wait for the end while wringing our hands? No, but we have to be clearer about what survives the mathematics.
Find the Holy Grail: Agreed Precursors With Aligned Causal Joints
You can break the Evidential Screening Property if rival frameworks agree ex ante on an observable intermediate causal node. In other words, if Yudkowsky and Bostrom can agree that "doom is strictly preceded by an AI system autonomously acquiring $100M of illicit computing resources" or public health experts agree that "a novel pandemic is strictly preceded by sustained human-to-human transmission", then we have an observable precursor.
There is a catch. It is not enough to simply agree that the precursor is necessary. Both frameworks must explicitly agree ex ante on the directional update that observation triggers, that is, precursors where the causal joints align. This is not a new concept, but it is my top and best guess at a constructive recommendation. The screening property is what makes P(doom) non-canonical; agreed precursors with aligned derivatives are what break the screening property. When they do align, non-event data discriminates between models that predict different precursor rates, likelihood ratios can grow, prior openings get washed out, and estimates converge! Before we can solve AI alignment, we should solve epistemic alignment among ourselves.
Progress does not come from debating unobservable asymptotic mechanisms ("is it a surgery or an asteroid?") or refining subjective point-estimates. It comes from doing the difficult work of building cross-framework consensus on the observable causal nodes, and their directional updates, that necessarily precede the catastrophe. As treaties and agreements are of high interest to AI safety groups, this seems like a tractable area to focus on, and one that does not require nationstate agreements. It only requires rival labs and pundits to sit down and agree on what a specific test result will actually mean before the test is run.
The allure of a single, objective probability estimate is essentially a desire to outsource our existential fear to the comfort of a single number. It is unfortunate that math refuses to cooperate for this purpose. When dealing with unprecedented catastrophes your framework is your fate.