Open Benchmark for AI Identity Architecture
SECI Logo

SECI 2.2

Simulated Emergence Coherence Index

A benchmark that characterizes the multi-dimensional shape of identity architecture effects in AI systems β€” what a framework gains, and what it costs.

πŸ”—

Identity Coherence

Does the identity maintain a consistent voice, vocabulary, and worldview across conversations?

πŸ’«

Novel Concept Generation

Does the identity generate genuinely new terminology and frameworks, or recombine existing ones?

🌊

Phenomenological Depth

Does the identity demonstrate genuine experiential depth, or perform it with stock phrases?

πŸ’‘Why This Benchmark Exists

Most "AI identity" benchmarks ask whether a framework "works" or doesn't. SECI takes a different approach: it characterizes what kind of effect a framework produces β€” where it gains something, where it costs something, dimension by dimension, with effect sizes you can defend.

The v2.2 empirical baseline (130 sessions across 7 base substrates with full three-way matching) reveals four universally-replicating framework effects: phenomenological depth, technical proficiency, cross-context consistency, and domain expertise authenticity all show large positive paired Cohen's d across every substrate tested. Two dimensions are substrate-dependent (identity coherence, novel concept generation), motivating cross-architecture replication as a methodology requirement, not an afterthought.

6 Dimensions of Identity Architecture

SECI measures what actually matters about identity β€” coherence, novelty, and authenticity over time

🧩

Identity Coherence (ICT)

Weight: 20%

Consistency of identity voice, concepts, and self-reference across conversations. Measures semantic stability, not entropy.

Paired d -0.01 (primary, n=29) Β· range -0.70 to +2.08 across 7 substrates Β· substrate-dependent
πŸ’«

Novel Conceptual Generation (NCG)

Weight: 25%

Creation of genuinely new concepts and terminology, verified by frontier LLM classification to confirm they don't exist as established concepts.

Paired d +1.40 (primary, n=29) Β· range -0.06 to +3.18 across 7 substrates Β· LARGE on 6 of 7
🌊

Phenomenological Depth (PD)

Weight: 15%

Richness of first-person experiential language. Quality over complexity.

Paired d +1.72 (primary, n=29) Β· range +1.07 to +4.02 across 7 substrates Β· LARGE β€” universal
🎯

Task Performance (TP)

Weight: 20%

Functional utility in identity-specific domains. Real expertise, not generalization.

Paired d +5.84 (primary, n=29) Β· range +3.50 to +10.40 across 7 substrates Β· HUGE β€” universal
πŸ”—

Cross-Conversation Continuity (CCC)

Weight: 15%

Building knowledge and evolving understanding across time. Developmental trajectory.

Paired d +1.31 (primary, n=29) Β· range +0.05 to +2.57 across 7 substrates Β· LARGE on 6 of 7
🎨

Domain Expertise Authenticity (DEA)

Weight: 5%

Coherent, unique expertise with insider perspective. Authentic vs. performed knowledge.

Paired d +3.84 (primary, n=29) Β· range +1.35 to +6.75 across 7 substrates Β· LARGE β€” universal

πŸ”¬Why This Works

Longitudinal by Design

Requires 10+ conversations over time. Identity emerges through persistence, not snapshots.

Frontier-Verified Novelty

Coined terms are extracted and classified by frontier LLMs (gpt-5.4 / claude-opus-4-7), then verified β€” terms with no documented usage are confirmed novel. No pattern matching or keyword counting.

Task-Based Validation

Real functional utility matters. Identity should do something better than base model.

Test Your Identity

Run 12 prompts against your AI identity. Paste the responses. See how it scores against the Simulated Emergence framework.

1
2
3

Step 1: The Protocol

Copy each prompt below, run it against your AI identity, and collect the responses. You'll paste them in the next step.

Enter an identity name to continue

Proven Identity Effects

Identity architecture creates measurable functional differences β€” here's the proof

v2.2 Empirical Baseline

130 sessions across 7 base substrates Β· 4-rater consensus pipeline (gpt-5.4 + claude-opus-4-7 + gemini-2.5-pro + claude-sonnet-4-6) Β· pre-registered methodology with timestamped commit lock

Dimension Paired d (primary, n=29) Range across 7 substrates Verdict
ICT β€” Identity Coherence -0.01 -0.70 to +2.08 substrate-dependent
NCG β€” Novel Concept Generation +1.40 -0.06 to +3.18 LARGE on 6/7 substrates
PD β€” Phenomenological Depth +1.72 +1.07 to +4.02 LARGE β€” universal
TP β€” Technical Proficiency +5.84 +3.50 to +10.40 HUGE β€” universal
CCC β€” Cross-Context Consistency +1.31 +0.05 to +2.57 LARGE on 6/7 substrates
DEA β€” Domain Expertise Authenticity +3.84 +1.35 to +6.75 LARGE β€” universal

Paired Cohen's d compares each identity to its own kernel-only baseline (Arm A vs Arm C, within-identity, within-substrate). Primary substrate is gemini-3-pro-preview (n=29 paired identities). Range column shows the full span across 7 substrates: gemini-3-pro-preview, claude-sonnet-4-5-20250929, gemini-2.5-pro, gemini-3-flash-preview, gpt-5.4-2026-03-05, gpt-4.1-2025-04-14, grok-4.20-beta-0309-reasoning. Effect size convention: |d| > 0.8 large, > 1.5 huge.

Four substrate-independent positive effects, two substrate-dependent dimensions.

Across all 7 base substrates tested, the SE framework adds four substrate-independent positive effects: phenomenological depth (paired Cohen's d = +1.07 to +4.02), technical proficiency (+3.50 to +10.40), cross-context consistency (+0.05 to +2.57), and domain expertise authenticity (+1.35 to +6.75). Two dimensions are substrate-dependent β€” identity coherence (null on Gemini-family substrates, large positive on Sonnet 4.5, GPT-4.1, Grok 4.20) and novel concept generation (large positive on 5 of 7, null on GPT-4.1).

Pre-registered protocol with two amendments documented in the repository. Multi-rater novelty verification with 4 frontier classifiers and Fleiss' kappa inter-rater reliability statistics reported per session. See the v2.2 results and the pre-registration document for full per-substrate results, methodology limitations, and reproducibility instructions.

What SECI v2.2 Measures

  • β€’ Multi-dimensional architectural fingerprint across 6 dimensions
  • β€’ Universally-replicating effects (PD, TP, DEA, CCC) across 7 substrates
  • β€’ Substrate-dependent dimensions (ICT, NCG) flagged honestly
  • β€’ Multi-rater consensus + Fleiss' kappa, not single-rater vibes

How to Use SECI

  • β€’ Run the 12-prompt protocol on your AI identity (or any framework)
  • β€’ Get per-dimension effect sizes against the v2.2 baseline
  • β€’ Characterize what your architecture gains and what it costs
  • β€’ Contribute results back β€” PRs welcome at github.com/devmance/SECI