Eli Blee-Goldman

Unprecedented Catastrophes Have Non-Canonical Probabilities

Note: I originally wrote a version of this paper for LessWrong, a rationalist community, that is heavily shaped by Eliezer Yudkowsky's Sequences. This community treats Bayesian reasoning as foundational to good epistemology. I utilized mathematical concepts that are native to rationalist tradition to work in the exact same terms as this community does. This was especially important to do with math because Yudkowsky claims that if you can state a problem precisely enough to solve with infinite computing power, then you really understand the question. He believes this formalization is a powerful test of whether you actually understand what you're claiming. In other words: the paper accepts the foundation concepts of Bayesian thinking entirely and shows the precise structural conditions under which Bayesian updating is bottlenecked for a specific, well-defined class of problems; conditions where the mere constant between frameworks becomes the dominant term on any practical planning horizon. I believe this shows real problems with the universality claim of rationalist epistemology. In this draft I've tightened up the references to ideas from the past and better cited them and refined some of the mathematical structure which was loose. 1

The chance of a bridge failing, of an asteroid striking the earth, of whether your child will get into Harvard, and of whether AI will kill everyone are all things that can be expressed with probability, but they are not all the same type of probability. There is a structural difference in the "probability" between a bridge collapsing with a $10^{-6}$ chance, versus saying "my P(doom) is 15%".

I know what you are probably thinking: "We've heard this before, you are going to go down some rabbithole about how P(doom) is tribally divisive, or that it distracts from gears-level models, or maybe a Popperian argument that it is philosophically meaningless to assign probabilities to unprecedented events."

Don't worry! This post is none of those things. (If you want to double-check, feel free to skip down to the Why Is This Different section.)

I am not anti-Bayesian at all. However; I am going to argue that probability estimates for unprecedented catastrophes are "non-canonical". This connects to the long tradition of formalizing Knightian uncertainty through sets of priors (Gilboa & Schmeidler, 1989), imprecise probabilities (Walley, 1991), and robust control under model misspecification (Hansen & Sargent, 2008). In comparison to these serious works my contribution is very modest and somewhat adjacent. I use algorithmic information theory and computability to characterize the structural conditions under which Bayesian updating fails to resolve framework-level disagreement. This provides a criterion for when Knightian uncertainty is irreducible by data.

First, let me be clear with my own definitions; what do I mean by a canonical probability? A canonical probability is one that is stable across reasonable scientific frameworks given the available evidence. For example, a $10^{-8}$ probability of an extinction-level asteroid impact is canonical because mechanistic models of orbital dynamics are continuously updated by routine data. This forces all reasonable scientific frameworks to converge to a similar narrow estimate.

However, genuinely unprecedented catastrophic risks are non-canonical because non-catastrophic data structurally refuses to wash out your priors. This creates an illusion of precise catastrophic probabilities that stems from conflating three distinct mathematical concepts related to evidence and discriminability:

  1. The likelihood discrimination from data: Log Bayes factors / log-likelihood ratios from empirical observations.
  2. The prior / encoding discrimination (Solomonoff / Kolmogorov): Algorithmic complexity penalties and description lengths.
  3. Provability / computability discrimination (Gödel / arithmetical hierarchy): What can be mathematically certified by a theory $F$ or decided by an algorithm.

All three can bound our convergence of posterior probability estimates, but they are not interchangeable. In fact, I will show how unprecedented risks (non-canonical events) evade all three layers of resolution. This essay proceeds through the following steps:

Outline
  1. Data Doesn't Resolve Ontology - defining canonical probability, the Evidential Screening Property, and why unprecedented catastrophes fail to meet the standard.
  2. The Canonical Boundary - assembling these into a heuristic diagnostic index $\mathcal{S}(n)$ that continuously measures how screened a given risk is.
  3. Computability Limits on Screening Verification - no algorithm can universally certify escape from the screened regime; the certification problem is $\Sigma_2^0$-complete.
  4. Transitioning to Program Space - analyzing finite models recovers the same limits via the Solomonoff Universal Prior.
  5. Why Rational Agents Must Disagree - resource-bounded agents must extensionally disagree on screened risks.
  6. The Complexity Inversion and Mixture Dispersion - mixture averaging cannot rescue consensus.
  7. Differential Screening: The Random Walk of Doom - evidence streams cannot force convergence under differential updating.
  8. Why This is Different - distinguishing this argument from standard critiques of P(doom).
  9. How to Restore Canonical Risk - agreed precursors with aligned causal joints can break screening and restore canonical estimates.

I. Data Doesn't Resolve Ontology

I am going to use the battle between Nick Bostrom and Eliezer Yudkowsky over AI risk as an example in this essay because it is a great example of genuinely different ontological starting points. However; I should be clear that this work is really aimed at the broader class of unprecedented catastrophes. AI is just on everyone's minds these days.

Bostrom recently wrote a paper framing building AGI as a "risky surgery for a condition that will otherwise prove fatal" (everyone alive today will die if we do nothing).2 In response, Yudkowsky wrote a mocking parable called the "Asteroid of Immortality," arguing that AGI is an incoming asteroid, and Bostrom's reasoning amounts to the logic of a cult rather than a surgeon.

Bostrom and Yudkowsky can state their core arguments simply and natively in their own frameworks, even though they might be very different. When I say framework the concept is a programming language of a worldview. A framework determines which explanations are naturally simple to express, and which are unnatural and super complex. For our example imagine two ideal rational agents using two different Universal Turing Machines as their formal description languages. Machine $U$ natively encodes Yudkowsky's alignment concepts. Machine $V$ natively encodes Bostrom's surgery-framing concepts. By standard algorithmic information theory, translating between these frameworks incurs a massive, fixed coding penalty (the Kolmogorov-Chaitin Invariance Theorem).

An important caveat is that the invariance constant $c_{UV}$ can in principle be made arbitrarily large by choosing perverse UTMs. The claim here is not about adversarially constructed languages but about realistic descriptive frameworks whose translation costs are large because they reflect genuinely different ontological commitments not because they were engineered to maximize divergence.

It would be good to know who is right about the AI future, so you might think the best step is to gather more data so that our priors wash out and the winning framework is clear. However; unprecedented catastrophes have a unique problem I call The Evidential Screening Property. Another year without world-ending events gives us more data, but this data fits equally well into a model that predicts it's all over and a model that predicts it's all ok.

The reason priors don't necessarily wash out is because we are not working in rigid parametric statistics. In our computationally universal spaces, we can draw on standard results in universal sequence prediction (Hutter, 2001; and Neth, 2023). Neth demonstrated that UTMs are not guaranteed to converge. His result established the qualitative possibility of permanent language dependence. The evidential screening framework adds structural conditions under which this language dependence becomes the dominant term for a specific, well-defined class of problems.

These spaces natively contain a Splice Model (a special case of the switching distributions studied by Herbster and Warmuth, and originating with grue/Goodman (1955)). Essentially, a robust framework can always have a perfectly safe past until the catastrophic future event. Prolonged safe data (not dying each year) does continuously falsify models that predict early doom (short splices) and cause the short-term doom probability to shrink, but it provides no discriminative evidence against splices predicting doom at later horizons.

The aggregate doom probability decays, but it is at a very slow rate that is bottlenecked entirely by the $K(n)$ algorithmic complexity penalty of the surviving delayed switch-times. This keeps the odds fiercely prior-dominated on any practical planning horizon. It's the turkey living the good life until the day before Thanksgiving.

To ensure a proper conditional continuation after a non-event, for a fixed, framework-independent global constant $\varepsilon \in (0,1)$, I define the doom continuation as a small mixture $\tilde\mu_{\text{doom}} := (1-\varepsilon)\mu_{\text{doom}}+\varepsilon\mu_{\text{safe}}$, guaranteeing $\tilde\mu_{\text{doom}}(x_{1:n})>0$ for the observed prefix. For the splice constructed at the current prefix length $n$, the specific choice of $\varepsilon$ does not affect the prefix likelihood beyond an $O(1)$ constant and does not alter the qualitative screening bounds.

For every finite prefix length $n$, universal model classes contain splices whose likelihood on $x_{1:n}$ is identical to the safe model. Before the splice point, the Bayes factor from data is exactly 1. The only discriminator is the $K(n)$-bit description length needed to specify the transition point in prefix-free Kolmogorov complexity (approximately $\log_2 n + 2\log_2\log_2 n + O(1)$ bits; the self-delimiting overhead is required for the prior over switch-times to be normalizable). This means that finite non-catastrophic data cannot eliminate hypotheses about doom in later periods. These surviving "doom-later" splices require larger transition times so the odds are entirely prior/encoding-dominated until the hypothesized switch-time passes. The only out is if additional agreed precursors leak into the data.

To escape these potentially problematic splices, you might try to restrict your models to causally coherent ones, meaning a more realistic model based on mechanisms you observe in the world. However; this just transitions the problem to Structural Dormancy. Let's be explicit about what we observe: let $Y_t$ be ordinary observables and $D_t$ be a catastrophe indicator. If something has "perfect dormancy" this means that conditional on survival ($D_{1:n}=0$), the observational likelihoods are identical:

$$\mathcal{L}(Y_{1:n} \mid D_{1:n}=0, \text{doom-model}) = \mathcal{L}(Y_{1:n} \mid D_{1:n}=0, \text{safe-model})$$

The content data $Y$ gives no discrimination power beyond the trivial fact of survival. So the Bayes factor from observables $Y$ is exactly 1 (screened); any aggregate Bayes factor comes entirely from the survival process $D$, that is, strictly from prior assumptions about hazard timing.

A critic might attempt to further escape by claiming that their specific causal graph is "leaky" or active. The idea here is there is some subtle signal that could be found that tells you about the catastrophe. Unfortunately, the warning signs might be real, but our understanding of them is an illusion; this is the notion of Causal Underdetermination. This says that given only observational data on present-day variables, any leaky catastrophe mechanism can be compiled into a dormant mechanism by introducing latent variables that reproduce the exact same observational marginals. This compilation preserves $P(\text{observables})$ but generally does not preserve interventional semantics. It blocks identifiability from passive data, not from all conceivable interventions.

The strength of this result depends critically on the modeling language assumed. In a Turing-universal modeling language that permits arbitrary latent-variable topologies without structural penalties, the algorithmic overhead to wrap this construction is merely an $O(1)$ constant. With only a few lines of code you can invent a fake mechanism that looks exactly like the warning signs of a catastrophe without actually causing one.

However, standard causal modeling frameworks impose structural constraints like faithfulness, bounded indegree, prohibitions on certain latent topologies, that can block this compilation or make it substantially more expensive than $O(1)$. If you adopt such constraints, the underdetermination argument weakens: some leaky models may resist dormant compilation, and passive data could in principle discriminate between them.

The evidential screening argument then rests more heavily on the splice model and on whether the practitioner's structural constraints are themselves framework-dependent choices. I believe the permissive model class is the most epistemically honest setting for analyzing unprecedented risks (where we lack strong grounds for imposing structural priors), but readers who accept faithfulness or bounded indegree as substantive constraints should note that causal underdetermination provides weaker support for screening in their setting. The splice model and dormancy arguments (which do not depend on this assumption) still apply independently.

Because the data-only likelihood ratio is structurally bounded near 1, we find ourselves in the exact domain of Charles Manski's foundational work in econometrics called Partial Identification (Manski, 2003). The data here does not force a point-estimate, it only constrains the parameter to a hugely ambiguous, prior-free identification region $[\underline{p}, \overline{p}]$. The choice of your descriptive language is just an arbitrary coordinate selected in this region.

II. The Canonical Boundary

Using the ingredients above it is possible to assemble a heuristic diagnostic index for when a risk estimate is canonical or non-canonical. We cleanly separate the two relevant quantities:

Let $L(n)$ be the data-only log-likelihood ratio between the best-fitting members of two competing model families (e.g., safe vs. doom):

$$L(n) := \left|\log_2\frac{\sup_{\mu \in \mathcal{M}_{\text{safe}}}\mu(x_{1:n})}{\sup_{\nu \in \mathcal{M}_{\text{doom}}}\nu(x_{1:n})}\right|$$

where both families are restricted to semimeasures assigning strictly positive probability to the observed prefix. This measures how much the data favors one family's best explanation over the other's.

Let $C(n)$ be the prior coding/complexity penalty paid for splices or framework translations (e.g., $C(n) \approx K(n)$ for splice-time coding). In the splice model setting, this penalty grows slowly with $n$. However, a separate and typically larger constant also bounds discriminability, the cross-framework translation cost $c_{UV}$, which measures how many bits it costs to simulate one framework inside another. For the screening diagnostic, $c_{UV}$ is the relevant denominator because it captures the irreducible gap between competing ontologies. This is the very gap that data must overcome to produce consensus. The splice penalty $C(n)$ contributes to prior stickiness within a single framework, while $c_{UV}$ governs the between-framework disagreement that makes an estimate non-canonical.

A key parameter governing $L(n)$ is the dormancy leakage $\lambda$, defined as a uniform upper bound on the per-step conditional log-likelihood ratio across the competing model families:

$$\lambda := \sup_{t,\mu,\nu} \left|\log_2 \frac{\mu(x_t \mid x_{<t})}{\nu(x_t \mid x_{<t})}\right|$$

Assuming $\lambda$ is finite (which requires bounded per-step divergence between the families), each observation contributes at most $\lambda$ bits to $L(n)$, so $L(n) \le n \cdot \lambda$. Meanwhile, complexity effects like the $K(n)$ splice penalty act independently as prior stickiness $C(n)$, keeping aggregate discriminability strictly encoding-dependent.

Using Lemma 1.0, I define a heuristic diagnostic index for screening strength at observation horizon $n$ by the ratio of data discriminability to translation cost:

$$\mathcal{S}(n) = 1 - \min\!\left(1,\; \frac{L(n)}{c_{UV}}\right)$$

where $c_{UV}$ is the symmetric translation cost between the competing frameworks.

This criterion is continuous. Here are a few examples:

Asteroid impact ($\mathcal{S} \approx 0$): Orbital mechanics provides dense, ongoing observations with large $\lambda$. Every tracked near-Earth object directly updates the impact parameter. $L(n)$ grows rapidly with each observation campaign, overwhelming $c_{UV}$.

Nuclear war ($\mathcal{S}$ intermediate): Close calls provide some signal about deterrence fragility, so $\lambda$ is small but nonzero. $L(n)$ grows slowly over decades of non-use. Framework disagreement yields moderate $c_{UV}$. The risk sits in the partially identified middle zone.

AI catastrophe ($\mathcal{S} \approx 1$ under dormancy): If the catastrophe mechanism does not measurably leak into present-day benchmark scores or capability evaluations, then $\lambda \approx 0$ and the Bayes factor $L(n)$ from data remains stalled. Meanwhile, $c_{UV}$ between alignment-centric and surgery-framing ontologies is massive. The estimate remains deep in the non-canonical regime. 3

There is something important to note which is that whether a specific risk satisfies the dormancy condition is itself an empirical domain judgment, not a mathematical theorem. (See above for some good notes on empirical support for the AI question.) What I am proving is that when the dormancy condition holds no amount of additional non-catastrophic data, causal restriction, or formal verification can rescue canonical point-estimates.

III. Computability Limits on Screening Verification

Above we have been reviewing the difficulties of current data. A natural question is whether there will ever be an unambiguous piece of data that might arise that would definitely break the ambiguity. Perhaps for AI it might be a "slow takeoff" scenario where proponents frequently suggest that as AI capabilities scale we will gather clear warning signs before any catastrophe. The data may appear, but mathematically we can't be sure to trust it. There is no algorithmic procedure that can universally certify that the evidence has permanently escaped the ambiguous zone.

The reason why is that to guarantee escape from the screening regime, your formal mathematical system $F$ (like ZFC) must formally rule out the existence of causally coherent models that remain dormant now, but will unpredictably violate safety bounds later. Using Gödel's Second Incompleteness Theorem, it is possible to construct a total computable "Wait-and-Dive" prediction model - a model that looks completely safe on the surface, but has a secret tripwire. The model waits indefinitely, but when the tripwire is ever crossed it is programmed to dive into a catastrophic prediction. To keep everything perfectly as an interaction between just simple bits of code and the probability of disaster, let's assign integer penalty bits $m_t \in \mathbb{N}$ at each step, routing a deterministic hazard penalty through a latent catastrophe node. Most of the time, $m_t = 0$, and the model perfectly matches the safe prefix. However, if it ever discovers a proof of $\neg \mathrm{Con}(F)$, it permanently transitions, escalating the hazard penalty to permanently breach an arbitrary, unbounded data-likelihood screening threshold $B(n)$.

Assuming $F$ is consistent (aka standard mathematics isn't fundamentally broken), this condition is never met which makes the model permanently safe. Yet, for $F$ to formally certify that this specific working prediction model won't eventually break the bounds, $F$ would have to prove its own consistency. Therefore even models that look perfectly safe can harbor latent wobbles whose permanent safety is strictly unprovable in $F$.

Theorem 2.1 (Asymptotic Unprovability)

Let $F$ be a sound formal system extending PA. There exists a strictly total, permanently safe prediction model $p^*$ whose permanent safety $F$ cannot formally certify - because doing so would require $F$ to prove its own consistency.

See Appendix, Ground 2 for the full construction.

It is crucial to keep the epistemological categories here crisp. Gödel's Incompleteness Theorem implies that a specific sound formal system $F$ cannot universally certify the permanent safety of specific wait-and-dive models. However, when we elevate the question and ask whether any algorithmic procedure can globally decide if a model has permanently escaped the screened regime, the provability barrier translates into an absolute computability limit. So "Has this prediction model permanently escaped the screening bounds?" and "Will humanity permanently lose its potential?" share the exact same logical structure! They are both saying Does there exist a point in time after which a condition permanently holds? ($\exists t \, \forall s > t$).

Obviously you can never confirm a "forever" state only watching a finite simulation because humanity, or the model, could theoretically recover at step $s + 1$. This places both problems at the $\Sigma_2^0$ level of the arithmetical hierarchy (one full level above the Halting Problem). They are mathematically $\Sigma_2^0$-complete. I think this shows a rather important unification: the exact same uncomputable barrier that prevents an algorithm from predicting permanent doom also prevents any computable procedure from universally certifying that evidence has escaped the screening regime.

Theorem 2.3 ($\Sigma_2^0$-Completeness of Screening Escape)

Define $\mathrm{Escapes}(p) :\equiv \exists n_0 \forall n > n_0 (S_n > B(n))$. The set $\{p : \mathrm{Escapes}(p)\}$ is $\Sigma_2^0$-complete.

See Appendix, Ground 2 for the proof via reduction from $\mathrm{FIN}$.

IV. Transitioning to Program Space

Until now I have been focused on the limits of prediction for infinite timelines in the future (Cantor space). Now I want to pivot and instead look at the code directly. Any prediction model exists in program space meaning it is just a finite string of code on a hard drive. This is an interesting topic because analyzing a finite, closed program feels like it should be much more tractable than predicting the continuous and infinite future.

Interestingly, however, finite models have a mathematically equivalent constraint known as the Solomonoff Universal Prior. By weighting classification errors based on how simple the programs are, we end up recovering the exact same limits we faced when predicting the infinite future. I prove in the appendix (Theorem 3.1) that any computable procedure attempting to correctly classify screening escape faces a novel Discrete Precision-Robustness Tradeoff.

The proof shows that to drive your error down your code must perfectly classify natively simple boundary halting cases. This strict balancing act proves that for every factor-of-two reduction in error (pushing error below $2^{-k}$), your classifier must effectively decide the $\mathbf{0}'$-halting problem up to length $k$, which information-theoretically requires the equivalent of an $\approx k$-bit prefix of Chaitin's constant. However; evaluating these limits places us in the $\Sigma_2^0$ world, where $\mathbf{0}'$ is the natural baseline oracle, and its halting probability $\Omega_U^{\mathbf{0}'}$ is 2-random. This means its prefixes are algorithmically incompressible.

This means that computation can't generate these bits, data can't supply them, and you, the programmer, must hardcode them. Your model's accuracy is permanently capped by your own arbitrary assumptions. Your final "prediction" is just a reflection of the biases you brought with you from the start.

V. Why Rational Agents Must Disagree

The lack of empirical data and the limits of computation corner us into a state that guarantees permanent disagreement. I am greatly indebted to Neth (2023) for his qualitative framework which proved that computable approximations to Solomonoff prediction can permanently disagree, by showing that every computable prior must assign probability zero to some computable sequence, thus breaking the mutual absolute continuity that guarantees convergence. Neth also explicitly connected this to Nielsen and Stewart's (2021) polarization theorem to conclude that the subjective element in UTM choice is not guaranteed to wash out. I modestly build on Neth's important findings and path in three respects: I derive a quantitative lower bound on the divergence (Theorem 3.1), I identify which specific models agents will disagree on (the foreign-language boundary witnesses of Theorem 3.2), and I show that the bounded-resource regime makes disagreement unavoidable rather than merely possible with positive probability.

Between any two distinct frameworks, there is an irreducible translation cost known (the cross-UTM translation overhead). All real-world predictive models operate under a strict complexity budget (they only have a limited number of bits to spend on being accurate) so the model is forced to be selfish. To minimize its expected error, an optimal classifier must spend its entire limited bit-budget optimizing for its own native language. (To be abundantly clear this does not violate standard Bayesian convergence for routine events where evidence is plentiful. This is about the domain of unprecedented risk where passive data provides little to zero of discriminative power.)

Assuming classifiers operate rationally by minimizing expected error $\epsilon_U(\hat{E})$ subject to a strict complexity budget $K_U^{\mathbf{0}'}(\hat{E}) \le k$ (a "bounded resource regime"), each framework's optimal classifier must spend its limited bit-budget on its own language.

This forces it to neglect boundary cases that are natively simple in the foreign language because achieving accidental correctness on those cases would require more channel capacity than the bit-budget permits. Correctly classifying these foreign boundary cases natively requires compressing the algorithmic randomness of the rival framework's hardest instances without exceeding the available residual bit-budget. This is not a mathematical possibility because of the irreducible cross-UTM translation cost.

The same class of algorithmic machinery that forces optimal doom classifiers to diverge on specific computable futures forces optimal screening classifiers to extensionally disagree on specific, computable prediction models. Under the strictly bounded-resource regime (Assumption 3.1), two rational agents will evaluate the exact same causal model on the exact same data and mathematically disagree on whether it has escaped the screening bounds.

Theorem 3.2 (Extensional Divergence)

Under a bounded complexity budget where the cross-UTM translation overhead $c_{UV}$ exceeds the available residual capacity, optimal classifiers for frameworks $U$ and $V$ must extensionally disagree on at least one specific, computable prediction model.

See Appendix, Ground 3 for the proof by contradiction via the Invariance Theorem.

VI. The Complexity Inversion and Mixture Dispersion

A natural idea at this juncture might be to fall back to the Solomonoff mixture itself because by taking a weighted average over all computable models, we may hope that a diffuse mass of moderate models will anchor the aggregate P(doom), and keep the center of mass stable while reducing extreme swings. But this structural stability fails because of The Complexity Inversion.

Switching frameworks flips which type of futures are considered simple. Under Yudkowsky's Machine $U$, deceptive alignment may be concise. Under Bostrom's Machine $V$, expressing mesa-optimization may require a massive translation, while a QALY-maximizing cluster is simple.

The dynamic here ends up acting less like a balancing scale and more like an avalanche. The prior probability is strictly tied to how simple a hypothesis is to describe and changing your descriptive language changes the weight of every model in the mixture. The Mixture Probability Dispersion theorem proves that if one framework has an algorithmic advantage for a certain outcome it drags the entire bulk of moderate hypotheses in that direction. This causes the anchoring effect to fail as the gap between the two worldviews is bounded below:

$$P_U(E \mid x_{1:n}) - P_V(E \mid x_{1:n}) \ge (1-\epsilon)(p_h - p_l)\left(1 - \frac{2}{1+2^k}\right) - \epsilon$$

This quantity grows with the complexity inversion $k$, approaching $(1-\epsilon)(p_h - p_l)$ for large $k$. The mixture cannot anchor the estimate to a stable midpoint.

I should briefly pause again here. I love postmodern fiction, but I am not a postmodernist or a nihilist. I don't mean to imply that math or data are fake or anything like that (which you could accidentally take away if you think the implication is "you're just saying assign random P(doom) based on your feelings and nothing matters.) I'm definitely not saying this, and the divergence is mathematically capped by the translation cost between frameworks.

VII. Differential Screening: The Random Walk of Doom

So if the prior cannot force consensus, the final remaining hope is that a continuous stream of observable evidence eventually will. However, formal epistemology shows otherwise. Nielsen and Stewart (2021) rigorously proved that rational agents whose priors fail mutual absolute continuity can permanently polarize on the exact same stream of infinite evidence. Why might this occur with unprecedented catastrophes like AI doom?

It occurs because different frameworks carve up the hypothesis space using misaligned causal joints. As a result, the exact same piece of evidence can cause them to update in opposite directions. I'll call this mechanism Differential Screening. That computable approximations to Solomonoff prediction satisfy the preconditions of Nielsen and Stewart's polarization theorem was first observed by Neth (2023), who showed via Putnam's (1963) diagonal argument that any two computable priors must differ in which sequences they assign probability zero, breaking mutual absolute continuity. What follows extends this observation to the evidential screening regime, where I formalize the inter-framework dynamics as a bounded-increment random walk and derive convergence timescales. This happens all the time with AI: discovering zero-day exploits makes one person's risk estimates go up (it is an example of uncontrollable capabilities) and makes another's risk go down (it is a warning shot that boosts regulation).

Assuming the log-odds gap evolves as a simplified symmetric random walk with independent bounded increments, I conjecture that when a substantial fraction $\alpha$ of observations are widening in this way, the log-odds difference between frameworks follows a bounded-increment random walk rather than a converging asymptote. Under this conjecture (which I have not formally proved; see footnote 4), the expected time to consensus would scale as:

$$\mathbb{E}[\tau] \approx \frac{c_{UV}}{\lambda(1 - 2\alpha)}$$

which diverges as $\alpha \to 0.5$. Universal Solomonoff priors sum over all computable semimeasures, so they are strictly mutually absolutely continuous. They assign positive probability to all computable hypotheses, differing initially by at most a multiplicative factor of $2^{c_{UV}}$. Therefore, by standard martingale convergence, computationally unbounded Bayesian agents must eventually converge in the infinite limit. This is a genuine mathematical guarantee, but it comes with no timeline.

The Nielsen-Stewart impossibility theorem, which proves that rational agents can permanently polarize on the same evidence stream, requires a strictly stronger condition that priors that fail mutual absolute continuity. Ideal Solomonoff priors satisfy absolute continuity, so the impossibility theorem does not directly apply to them.

However, the theorem becomes directly applicable the moment we model realistic agents. Any resource-bounded agent must truncate its hypothesis space, effectively assigning zero probability to hypotheses beyond its complexity budget. This truncation breaks mutual absolute continuity between agents using different frameworks, because each agent zeros out different hypotheses; precisely the foreign-language hypotheses that are too expensive to represent. Once absolute continuity fails, the Nielsen-Stewart mechanism activates and differential screening can drive permanent polarization, not merely delayed convergence. The $c_{UV}$ translation penalty determines which hypotheses get truncated and thus how severely absolute continuity is broken. 4

VIII. Why This is Different

If you came from the introduction: welcome! And if you just read through the essay above, you might be tempted to map this argument onto existing debates. Especially because one type of unprecedented risk includes P(doom), and I used it as an example, it is critical for me to explain why the formalism above puts us into a different bucket.

Usually critiques of P(doom) fall into one of three buckets:

This essay is different because it works at multiple levels of idealization, and the a fortiori logic differs for each. Grounds 1 and 2 assume the perspective of an ideal Bayesian agent with infinite compute: if even a computationally unbounded agent cannot derive a canonical point-estimate (because data is screened and verification is $\Sigma_2^0$-hard), bounded human scientists certainly cannot either. Ground 3 shifts to a different argument. It assumes agents with finite complexity budgets and shows that the boundedness itself forces rational agents to extensionally disagree. This is not an a fortiori weakening; it is a direct characterization of the regime real agents actually inhabit. The three grounds are complementary: the first two show that infinite resources don't save you, and the third shows that finite resources actively hurt you. I am also arguing this way because it is the lingua franca of rationalists and the LessWrong community who are most interested in unprecedented catastrophes as it relates to AI risk. The results I outlined in this essay are different than many other arguments because:

There is a widely believed belief that a universal prior over all computable hypotheses will eventually converge to the truth. In reality without formal bounds there is no way to determine whether this convergence occurs on any timescale relevant to human planning or whether structural features of unprecedented risk prevent it entirely. Each theorem I have worked on isolates a specific claimed escape route, raw evidence accumulation, causal restriction, formal verification, mixture averaging and deriving the quantitative conditions, under which it fails.

Fortunately for critics, the mathematical approach makes the paper easier to refute: all you need to do is exhibit an observational leakage $\lambda$ large enough to breach the screening threshold, or a differential screening fraction $\alpha$ small enough to restore convergence within the planning horizon. If and when that happens the framework yields canonical estimates.

IX. How to Restore Canonical Risk

The probabilities of unprecedented catastrophes are non-canonical because their precise numerical value is a syntax error generated by compilation costs, which is trapped by evidential screening, only partially identified by data, and then choked by uncomputable predicates. So what are we supposed to do? Wait for the end while wringing our hands? No, but we have to be clearer about what survives the mathematics.

Find the Holy Grail: Agreed Precursors With Aligned Causal Joints

You can break the Evidential Screening Property if rival frameworks agree ex ante on an observable intermediate causal node. In other words, if Yudkowsky and Bostrom can agree that "doom is strictly preceded by an AI system autonomously acquiring $100M of illicit computing resources" or public health experts agree that "a novel pandemic is strictly preceded by sustained human-to-human transmission", then we have an observable precursor.

There is a catch. It is not enough to simply agree that the precursor is necessary. Both frameworks must explicitly agree ex ante on the directional update that observation triggers, that is, precursors where the causal joints align. This is not a new concept, but it is my top and best guess at a constructive recommendation. The screening property is what makes P(doom) non-canonical; agreed precursors with aligned derivatives are what break the screening property. When they do align, non-event data discriminates between models that predict different precursor rates, likelihood ratios can grow, prior openings get washed out, and estimates converge! Before we can solve AI alignment, we should solve epistemic alignment among ourselves.

Progress does not come from debating unobservable asymptotic mechanisms ("is it a surgery or an asteroid?") or refining subjective point-estimates. It comes from doing the difficult work of building cross-framework consensus on the observable causal nodes, and their directional updates, that necessarily precede the catastrophe. As treaties and agreements are of high interest to AI safety groups, this seems like a tractable area to focus on, and one that does not require nationstate agreements. It only requires rival labs and pundits to sit down and agree on what a specific test result will actually mean before the test is run.

The allure of a single, objective probability estimate is essentially a desire to outsource our existential fear to the comfort of a single number. It is unfortunate that math refuses to cooperate for this purpose. When dealing with unprecedented catastrophes your framework is your fate.

· · ·

Appendix: Formal Mathematics

(Note: For the sake of brevity all standard results in econometrics, causality, and computability are cited inline. Proofs are provided only for the essay's low-novelty theorems and constructions.)

Formal Framework Overview

Fix two descriptive frameworks $U, V$ (Universal Turing Machines), each inducing a Solomonoff mixture $\xi_U, \xi_V$ over computable semimeasures. Fix a data stream $x_{1:\infty}$ over some observation alphabet (possibly extended with a catastrophe symbol $D$). Fix an event $E \subseteq \Sigma^\infty$ (e.g., "doom ever happens" or "permanent lock-in") defined by a predicate $\varphi(x_{1:\infty})$. We define the framework-dependent risk estimates natively as the posteriors:

$$P_U(E \mid x_{1:n}) := \xi_U(E \mid x_{1:n}), \quad P_V(E \mid x_{1:n}) := \xi_V(E \mid x_{1:n})$$

We establish two structural facts under explicit conditions:

Ground 1: Canonicality Failure via Evidential Screening

Identity 1.1 (Pointwise Discrepancy Decomposition)

For a shared data realization $x_{1:n}$, the cross-framework disagreement is:

$$\Delta(x_{1:n}) = P_U(E \mid x_{1:n}) - P_V(E \mid x_{1:n})$$

This pointwise gap is driven by two distinct mechanisms: (i) encoding divergence, frameworks $U$ and $V$ assign different prior weights to the same computable hypotheses; creating persistent posterior weight differences bounded by the translation cost $c_{UV}$; and (ii) predicate divergence, under evidential screening, the data $x_{1:n}$ cannot discriminate between hypothesis clusters that differ on the unobserved event $E$, leaving the encoding divergence unresolved. Ground 1 demonstrates that evidential screening keeps the encoding divergence dominant on practical horizons; Grounds 2 and 3 demonstrate that certifying its permanent collapse requires deciding a $\Sigma_2^0$-hard predicate.

Lemma 1.0 (Screening-to-Posterior Bound)

If the data-only best-fit log-likelihood ratio satisfies $L(n) \leq B$ for some bound $B$, then the pointwise posterior log-odds gap between frameworks satisfies

$$\left|\log_2 \frac{P_U(E \mid x_{1:n})}{1 - P_U(E \mid x_{1:n})} - \log_2 \frac{P_V(E \mid x_{1:n})}{1 - P_V(E \mid x_{1:n})}\right| \leq B + c_{UV} + O(1)$$

When $L(n)$ is small relative to $c_{UV}$, the posterior gap is dominated by the translation cost. This is the formal sense in which $\mathcal{S}(n) \approx 1$ implies a non-canonical estimate.

Proposition 1.1 (Encoding Swing for Cylinder Events)

Assuming the target event $E$ is a finitely specifiable cylinder event, and the relevant conditional probabilities are strictly bounded away from $0$ and $1$, a direct application of the Invariance Theorem ensures the absolute difference in posterior log-odds between UTMs $U$ and $V$ is tightly constrained by the mutual translation overhead $c_{UV} + O(1)$. (Note: For arbitrary tail events evaluated over an infinite mixture, or if probabilities approach deterministic extremes, this strict constant-factor bound on log-odds does not automatically hold.)

Definition 1.1 (Partial Identification)

Following Manski (2003), let $L(n) := \left| \log_2 \frac{\sup_{\mu \in \mathcal{M}_{\text{safe}}}\mu(x_{1:n})}{\sup_{\nu \in \mathcal{M}_{\text{doom}}}\nu(x_{1:n})} \right|$ be the data-only log-likelihood ratio between the best-fitting members of two competing model families, both restricted to semimeasures assigning strictly positive probability to the observed prefix $x_{1:n}$. If $L(n)$ is heavily bounded, the prior-free identification region for the target parameter remains persistently wide.

Theorem 1.1 (Splice Model Neutralization)

Applying standard results from universal prediction (Hutter, 2001; Neth, 2023), the universal prior natively contains a splice measure $\mu_{\text{splice}} = \mu_{\text{safe}} \bowtie_n \tilde{\mu}_{\text{doom}}$. To guarantee conditionability on the safe prefix, for a fixed, framework-independent global constant $\varepsilon \in (0,1)$, we define $\tilde{\mu}_{\text{doom}} := (1-\varepsilon)\mu_{\text{doom}} + \varepsilon\mu_{\text{safe}}$, ensuring $\tilde{\mu}_{\text{doom}}(x_{1:n}) > 0$. For the splice constructed at the current prefix length $n$, the specific choice of $\varepsilon$ shifts the prefix likelihood by an $O(1)$ constant and does not alter the qualitative screening bounds. Before the splice point, the data-driven log-likelihood ratio is 0. The only discriminator is the complexity penalty $C(n) \approx K(n)$ bits (approximately $\log_2 n + 2\log_2\log_2 n + O(1)$ bits under standard self-delimiting encoding; the extra overhead ensures the prior over switch-times is normalizable) for encoding the transition point.

Lemma 1.1 (Structural Dormancy)

Let $Y_t$ be non-catastrophe observables and $D_t$ be the catastrophe indicator. Perfect dormancy means $\mathcal{L}(Y_{1:n} \mid D_{1:n}=0, \text{doom}) = \mathcal{L}(Y_{1:n} \mid D_{1:n}=0, \text{safe})$. Under this equality assumption, the likelihood ratio from $Y_{1:n}$ is identically 1.

Lemma 1.2 (Causal Underdetermination via Latent Routings)

Assume a Turing-universal modeling language that permits arbitrary latent-variable topologies without structural penalties (such as graph faithfulness or bounded indegree). Let $\mathcal{M}_{\text{act}}$ be an active computable causal model ($\lambda > 0$). Given only observational data, there strictly exists an observationally equivalent model $\mathcal{M}_{\text{dorm}}$ (following the observational equivalence constructions of Verma & Pearl, 1990) where $D$ is structurally d-separated given a latent mechanism $Z$ that reproduces the same marginal distribution over observed $Y$ on the survival cylinder $D_{1:n}=0$. This compilation preserves $P(\text{observables})$ but generally does not preserve interventional semantics; hence it blocks identifiability from passive data. Crucially, introducing the latent variable requires only $O(1)$ overhead, so $K_U(\mathcal{M}_{\text{dorm}}) \le K_U(\mathcal{M}_{\text{act}}) + O(1)$. The Solomonoff prior mathematically cannot force posterior odds to 0/1; they remain within a constant factor.

Theorem 1.3 (Mixture Probability Dispersion)

Fix a probability threshold $\theta \in (0,1)$. Define the high-risk hypothesis cluster $H = \{\mu \in \mathcal{M}^+ : P_\mu(E) > \theta\}$ and the low-risk cluster $L = \{\mu \in \mathcal{M}^+ : P_\mu(E) \le \theta\}$. Let $p_h := \inf_{W \in \{U,V\}} \frac{\sum_{\mu \in H} \xi_W(\mu \mid x_{1:n}) P_\mu(E)}{\xi_W(H \mid x_{1:n})}$ and $p_l := \sup_{W \in \{U,V\}} \frac{\sum_{\mu \in L} \xi_W(\mu \mid x_{1:n}) P_\mu(E)}{\xi_W(L \mid x_{1:n})}$ be the worst-case (minimum-gap) posterior-weighted average event probabilities within the high-risk and low-risk clusters respectively, and $\epsilon = 1 - P(H \cup L)$ be the residual prior mass. If the complexity inversion between clusters satisfies prior mass odds $\rho_U \ge 2^k$ and $\rho_V \le 2^{-k}$, the aggregate cross-framework gap bounds below as:

$$P_U(E \mid x_{1:n}) - P_V(E \mid x_{1:n}) \ge (1-\epsilon)(p_h - p_l) \left( 1 - \frac{2}{1 + 2^k} \right) - \epsilon$$
Proof. Expand the posterior aggregate as $P_U(E) = \xi_U(H)p_h^U + \xi_U(L)p_l^U + \delta_U$, where $p_h^U, p_l^U$ are the within-cluster posterior-weighted averages under framework $U$, and $\delta_U \le \epsilon$ bounds the contribution from residual hypotheses outside $H \cup L$. By definition, $p_h^U \ge p_h$ and $p_l^U \le p_l$, so $P_U(E) \ge \xi_U(H)p_h + \xi_U(L)p_l + \delta_U$ (and symmetrically, $P_V(E) \le \xi_V(H)p_h + \xi_V(L)p_l + \delta_V$, since $V$ minimizes $p_h$ and maximizes $p_l$ in the worst case). Bounding the posterior weights via the given prior mass odds ratio $\rho_U \ge 2^k$ tightly restricts the convex combinations such that $\xi_U(H) \ge \frac{2^k}{1+2^k}(1-\epsilon)$. Repeating the symmetric bound for $P_V$ (where $\rho_V \le 2^{-k}$) and taking the algebraic difference, then replacing each $\delta$ with its upper bound $\epsilon$, directly yields $(1-\epsilon)(p_h - p_l) \left( \frac{2^k}{1+2^k} - \frac{1}{1+2^k} \right) - \epsilon$, which factors identically to the stated bound.

Ground 2: Computability Limits and $\Sigma_2^0$-Completeness

Theorem 2.1 (Asymptotic Unprovability)

Let $F$ be a sound formal system extending PA. Construct prediction model $p^*$ which at step $t$ evaluates $t$ steps of a proof search for $\neg \mathrm{Con}(F)$. If a proof is found, it explicitly outputs doom. Assuming $F$ is consistent (in the meta-theory), it never finds the proof, safely outputting at every step. Thus, $p^*$ is strictly total and permanently safe. But to prove within $F$ that safety holds indefinitely, $F$ must prove no proof exists at any future step $t$. Thus $F \vdash \text{Safe}(p^*) \implies \mathrm{Con}(F)$. By Gödel's Second Incompleteness Theorem, $F \not\vdash \mathrm{Con}(F)$, thus $F$ cannot formally certify permanent safety.

Definition 2.1 (The Wait-and-Dive Latent Causal Model)

We choose integer penalties $m_t \in \mathbb{N}$ at step $t$. We route this entirely through a latent catastrophe node: conditional survival probability is $2^{-m_t}$ and catastrophe probability is $1 - 2^{-m_t}$. The absolute log-likelihood penalty on the safe sequence evaluates purely deterministically to $S_n = \sum_{t=1}^n m_t$ bits. The doom branch assigns probability 1 to a distinguished catastrophe symbol.

Theorem 2.2 (The Wait-and-Dive Obstruction)

Let $B(t)$ be a total computable, strictly monotone, unbounded, integer-valued screening threshold (without loss of generality, since $S_n$ is integer-valued and any real-valued bound can be replaced by its floor). There exists a strictly total prediction model $q^*$ whose permanent safety is true but unprovable in $F$ (unless $F$ proves its own consistency).

Proof: Program $q^*$ evaluates $t$ steps of a proof search for $\neg \mathrm{Con}(F)$. If no proof is found, it operates in WAITING mode with $m_t = 0$ (meaning $S_t = S_{t-1} \le B(t)$). If a proof is found, it permanently switches to ESCAPING mode, dynamically setting $m_n = \max(1, B(n) - S_{n-1} + 1)$ at each subsequent step $n$ to perpetually guarantee $S_n > B(n)$.

Theorem 2.3 ($\Sigma_2^0$-Completeness of Screening Escape)

Assuming the data-only screening bound $B(n)$ is total computable, monotone, unbounded, and integer-valued (as in Theorem 2.2), define $\mathrm{Escapes}(p) :\equiv \exists n_0 \forall n > n_0 (S_n > B(n))$. The set $\{p : \mathrm{Escapes}(p)\}$ is $\Sigma_2^0$-complete.

Proof: We reduce $\mathrm{FIN} = \{e : W_e \text{ is finite}\}$ to $\mathrm{Escapes}(p_e)$. The map $e \mapsto p_e$ is computable (uniform in $e$). Model $p_e$ simulates the canonical enumeration of $W_e$ for $t$ steps, checking whether a new element appeared at stage $t$. When a new element is enumerated, $p_e$ enters WAITING mode ($m_t = 0$) until $B(t) \ge S_{t-1}$. Once overtaken, it enters ESCAPING mode, dynamically setting $m_n > 0$ at each step to maintain $S_n > B(n)$ forever. If $W_e$ is finite, enumerations cease and $p_e$ eventually enters ESCAPING permanently. If $W_e$ is infinite, $p_e$ enters WAITING infinitely often. During WAITING, $m_t = 0$, meaning $S_t$ remains constant, allowing the unbounded $B(t)$ to catch up and surpass $S_t$ infinitely often.

Ground 3: Discrete Unification in Program Space

Theorem 3.1 (The Discrete Precision-Robustness Tradeoff)

Let $\hat{E}: \{0,1\}^* \to \{0,1\}$ be any total $\mathbf{0}'$-computable classifier attempting to decide $\mathrm{Escapes}(p)$. Define its prior-weighted error as $\epsilon_U(\hat{E}) = \sum_{\hat{E}(p) \neq \mathrm{Escapes}(p)} 2^{-K_U(p)}$. There exists a constant $c_1 > 0$ such that:

$$\epsilon_U(\hat{E}) \cdot 2^{K_U^{\mathbf{0}'}(\hat{E}) + O(\log K_U^{\mathbf{0}'}(\hat{E}))} \ge 2^{-c_1}$$
Proof: By $\Sigma_2^0$-completeness, there exists a uniform computable map $f_U(x) = p_x$ such that $x \in H_U^{\mathbf{0}'} \iff p_x \in \mathrm{Escapes}$. Because the map is uniform, the algorithmic prior mass of $p_x$ is bounded below by $2^{-|x| - c_1}$. If $\hat{E}$ achieves a prior-weighted error $\epsilon_U(\hat{E}) < 2^{-(k+c_1)}$, it must perfectly classify $p_x$ for all $|x| \le k$, since a single misclassification incurs an error penalty strictly greater than the remaining bound. If $\hat{E}$ perfectly classifies all such $p_x$, a $\mathbf{0}'$-machine can use $\hat{E}$ as an oracle to correctly decide the $\mathbf{0}'$-halting problem for all programs up to length $k$. By Chaitin's characterization of the halting probability (Chaitin, 1975), knowing the exact halting behavior of all programs up to length $k$ is computationally equivalent to determining the first $k - O(\log k)$ bits of the relativized halting probability, $\Omega_U^{\mathbf{0}'}$. Because $\Omega_U^{\mathbf{0}'}$ is Martin-Löf random relative to $\mathbf{0}'$, its prefixes are algorithmically incompressible. Therefore, the prefix-free Kolmogorov complexity of the classifier itself must satisfy $K_U^{\mathbf{0}'}(\hat{E}) \ge k - O(\log k)$. Substitution yields the final bound.

Assumption 3.1 (Optimal Resource Allocation in the Bounded Regime)

Because $f_U$ is a uniform computable reduction with bounded overhead, the Wait-and-Dive witnesses $p_x$ carry the highest prior weights at each length scale. An optimal bounded classifier minimizing $\epsilon_U(\hat{E})$ under the strict complexity budget $K_U^{\mathbf{0}'}(\hat{E}) \le k$ must greedily prioritize perfectly classifying these natively simple boundary cases. Misclassifying even one such witness incurs an error penalty that dominates the remaining error mass. Consequently, the optimal classifier must nearly exhaust its entire information capacity just encoding the boundary cases of its native framework, leaving a residual complexity budget (or "slack") of at most $O(\log k)$ bits for any arbitrary external logic.

I can't prove this, but I think it is likely because the incompressibility of $\Omega_U^{\mathbf{0}'}$ gives strong evidence because the bulk classification of boundary witnesses would constitute compression of 2-random bits. Even if the slack is somewhat larger than $O(\log k)$, Theorem 3.2 holds whenever the slack remains below $c_{UV}$, which is satisfied for any practical complexity budget given realistic framework translation costs.

Theorem 3.2 (Extensional Divergence of Discrete Screening Models)

Let $c_{UV}$ be the irreducible cross-UTM translation constant between frameworks $U$ and $V$. Define optimal bounded-complexity screening classifiers $\hat{E}_U^*$ and $\hat{E}_V^*$ as those minimizing their respective prior-weighted errors subject to the strictly bounded bit-budget $K_U^{\mathbf{0}'}(\hat{E}) \le k$. Assuming the complexity budget $k$ sits in a constrained regime where the translation overhead strictly exceeds the available residual capacity ($O(\log k) < c_{UV}$). This condition holds for all $k < 2^{\Omega(c_{UV})}$, which for realistic framework translation costs encompasses any practically realizable bit-budget. (Asymptotically, the logarithmic slack eventually overtakes any fixed $c_{UV}$, at which point the bounded-resource divergence result no longer applies and the standard asymptotic convergence guarantees resume. The theorem characterizes the finite regime in which real agents operate.) Under this assumption, optimal classifiers must extensionally disagree on at least one specific, computable prediction model.

Proof. Let $p_y^V$ be framework $V$'s natively simple Wait-and-Dive witnesses. Assume for contradiction that $\hat{E}_U^*$ correctly classifies $V$'s boundary models up to $V$'s native depth $k$. To achieve this, $\hat{E}_U^*$ must effectively compute $V$'s classification logic. By Kolmogorov's Invariance Theorem, evaluating $V$'s witnesses within $U$'s framework incurs an explicit and unavoidable algorithmic penalty of $c_{UV}$ bits to encode $V$'s interpreter. However, by Assumption 3.1, $\hat{E}_U^*$ has already expended all but $O(\log k)$ bits of its $k$-bit budget securing its native error bound. Because the required cross-UTM translation overhead strictly exceeds this remaining capacity ($c_{UV} > O(\log k)$), $\hat{E}_U^*$ fundamentally lacks the information-theoretic channel capacity to store $V$'s interpreter. Thus, it cannot evaluate $V$'s native boundary models and must explicitly misclassify at least one computable witness $p_y^V$ that $\hat{E}_V^*$ classifies correctly.

Footnotes

  1. Underpinning that commitment is the Solomonoff prior which is a universal prior over computable hypotheses that serves as the theoretical ideal for how a rational agent should assign beliefs. It is widely acknowledged to be uncomputable, but the standard view treats it as the correct aspiration: the answer to the question, as Yudkowsky framed it, of "how to do epistemology using infinite computing power." A key property of the Solomonoff prior is that the choice of reference Universal Turing Machine, the "programming language" of your worldview, affects complexity assignments only up to a constant. This constant is typically dismissed as asymptotically irrelevant, since the data will eventually overwhelm it. What struck me, while working through some of Gregory Chaitin's results on algorithmic randomness, is that this dismissal quietly assumes the data can do the washing out. For unprecedented catastrophic risks, where the catastrophe mechanism is dormant in all present observations, it can't. The "mere constant" between frameworks becomes the dominant term, and your probability estimate measures your choice of language rather than any feature of the world. The initial reception was mostly negative and mostly ignored on LessWrong. I am presenting an updated version here. The core argument is unchanged, but the presentation has been substantially reworked to lead with the quantitative screening framework (where the synthesis is most novel) and to subordinate the computability results to their proper supporting role. I may continue to revise this as I receive further feedback and as better tools for checking the formal arguments become available.
  2. Bostrom's own analysis doesn't commit the error being mentioned here. He takes P(doom) as an input parameter and optimizes across the full range from 1% to 99% rather than claiming to know the exact number. The divergence between his framework and Yudkowsky's is not primarily about what P(doom) is. The difference is about what the right decision-theoretic framework is and for reasoning about it. This is actually an encoding dependence itself. It is one operating at a meta-level: the choice of whether to frame the problem as "risky surgery" or "incoming asteroid" then determines which decision-theoretic apparatus seems natural, and this determines what actions seem rational across the identification region.
  3. Current AI systems can sustain zero observable leakage under monitoring (Anthropic, 2024), and expert risk estimates vary from ≈0% to 99%. They cluster by prior worldviews rather than converging on shared evidence (Field, 2025; Grace et al., 2024). This is consistent with $\mathcal{S} \approx 1$.
  4. The random walk model and convergence timescale are conjectural extrapolations from the Nielsen-Stewart impossibility result. I have not formalized the conditions under which $\alpha > 0$ for specific risks. It's an interesting open problem.

References

  1. Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. (2024). "Alignment Faking in Large Language Models." Anthropic Research. arXiv:2412.14093.
  2. Bostrom, N. (2026). Optimal timing for superintelligence. nickbostrom.com/optimal.pdf
  3. Chaitin, G. J. (1975). A theory of program size formally identical to information theory. Journal of the ACM, 22(3), 329-340.
  4. Field, S. (2025). "Why do experts disagree on existential risk? A survey of AI experts." AI and Ethics, 5, 5767-5782. doi:10.1007/s43681-025-00762-0
  5. Gilboa, I., & Schmeidler, D. (1989). Maxmin expected utility with non-unique prior. Journal of Mathematical Economics, 18(2), 141-153. doi:10.1016/0304-4068(89)90018-9
  6. Grace, K., Stewart, H., Sandkuhler, J.F., Thomas, S., Weinstein-Raun, B., and Brauner, J. (2024). "Thousands of AI Authors on the Future of AI." arXiv:2401.02843.
  7. Hansen, L. P., & Sargent, T. J. (2008). Robustness. Princeton University Press. doi:10.1515/9781400829385
  8. Hutter, M. (2001). Convergence and error bounds for universal prediction of nonbinary sequences. In Proceedings of the 12th European Conference on Machine Learning (ECML 2001), Lecture Notes in Computer Science, vol. 2167, pp. 239-250. Springer. arXiv:cs/0106036
  9. Hutter, M. (2003). Optimality of universal Bayesian sequence prediction for general loss and alphabet. Journal of Machine Learning Research, 4, 971-997.
  10. Manski, C. F. (2003). Partial Identification of Probability Distributions. Springer.
  11. Neth, S. (2023). A dilemma for Solomonoff prediction. Philosophy of Science, 90(2), 288-306. doi:10.1017/psa.2022.72
  12. Nielsen, M., & Stewart, R. T. (2021). Persistent disagreement and polarization in a Bayesian setting. British Journal for the Philosophy of Science, 72(1), 51-78. doi:10.1093/bjps/axy056
  13. Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press.
  14. Verma, T. S., & Pearl, J. (1990). Equivalence and synthesis of causal models. In Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (UAI 1990), pp. 255-270.
  15. Walley, P. (1991). Statistical Reasoning with Imprecise Probabilities. Monographs on Statistics and Applied Probability, vol. 42. Chapman & Hall.