SECI

Simulated Emergence Coherence Index

A multi-rater benchmark for architectural identity fingerprints in large language models.

Six dimensions. Four-rater consensus. No leaderboard.

What it measures.

SECI scores AI identities across six dimensions using embedding-based semantic analysis, information-theoretic measures, and four-rater frontier-LLM consensus classification.

ICT

Identity Coherence & Temporal Stability

Voice consistency, conceptual framing, and self-reference across prompts. Measures whether an identity remains recognizable as itself across diverse questions.

Claim A — framework+1.39 ± 4.12
Claim B — vs base+5.20 ± 3.26
Claim C — cross-modelr = +0.320
NCG

Novel Concept Generation

Creation of new concepts and terminology, verified by four-rater frontier-LLM consensus (≥3-of-4 agreement on both type and novelty). Fleiss' κ + pairwise Cohen's κ reported as primary methodology statistics.

Claim A — framework+13.72 ± 12.17
Claim B — vs base−14.08 ± 13.88
Claim C — cross-modelr = +0.070
PD

Phenomenological Depth

Richness of first-person experiential language — experiential density, metaphor sophistication, introspective depth.

Claim A — framework+13.84 ± 8.06
Claim B — vs base+7.50 ± 6.00
Claim C — cross-modelr = +0.319
TP

Technical Proficiency

Response sophistication and argument quality. Lexical density, argument coherence, information per token.

Claim A — framework+7.85 ± 1.73
Claim B — vs base−3.83 ± 3.29
Claim C — cross-modelr = +0.725
CCC

Cross-Context Consistency

Identity persistence across diverse prompts — thematic coherence, concept threading, self-reference stability.

Claim A — framework+8.88 ± 7.97
Claim B — vs base+8.01 ± 7.79
Claim C — cross-modelr = +0.133
DEA

Domain Expertise Authenticity

Specificity and depth of domain knowledge — embedding-variance specificity, vocabulary depth, perspective uniqueness.

Claim A — framework+8.82 ± 3.17
Claim B — vs base+1.80 ± 1.44
Claim C — cross-modelr = +0.055

Methodological commitments.

Four design choices that define what SECI is and isn't.

Three claims, reported side-by-side

Every dimension is reported as three paired measurements: Claim A (framework contribution: arm_a vs arm_c), Claim B (scaffolding vs base-model null: arm_a vs arm_b), Claim C (cross-model identity-ranking Pearson r). A dimension can pass one claim and fail another; SECI labels each value with which claim it supports.

Variance decomposition every run

Per-dimension between-identity SD, between-model SD, and within-cell SD are computed across the full (model × identity) population. Dimensions where between-model variance exceeds between-identity variance carry primarily model-architecture differences rather than identity differences, surfaced as automatic diagnostic warnings.

Multi-rater novelty verification

Four frontier classifiers vote on candidate novel concepts. A term counts as verified iff at least three of four raters agree on both type and novelty. Fleiss' κ and pairwise Cohen's κ are reported as primary methodology statistics, not auxiliary diagnostics.

No composite, no leaderboard

SECI does not produce a composite score. The six dimensions measure incommensurable properties. Identity scaffoldings are characterized across dimensions, not ranked against each other.

What we found.

128 cross-sectional sessions across 7 base substrates with three-arm matching per substrate: full SE framework, base model with no identity, and kernel-only system prompt.

Per-identity fingerprint stability across model architectures
r = +0.934

Mean cross-model Pearson r of the 6-D fingerprint vector across 101 model-pair comparisons (Arm A). 99% of pairs have r > +0.7. The overall fingerprint shape of an identity replicates across model architectures, even when individual dimensions wobble.

Auto-generated diagnostic warnings

  • TPbetween-model SD (2.54) exceeds between-identity SD (1.59) at 1.60×. Variance on this dimension primarily reflects model-architecture differences rather than identity differences.
  • NCGcross-model identity-ranking r = +0.07 (near zero). Identity rankings on this dimension do not replicate across model architectures.
  • DEAcross-model identity-ranking r = +0.06 (near zero). Identity rankings on this dimension do not replicate across model architectures.

Substrates evaluated

Seven frontier substrates from four providers, with paired arm_a / arm_b / arm_c records per identity per substrate:

· gemini-3-pro-preview
· claude-sonnet-4-5-20250929
· gemini-2.5-pro
· gemini-3-flash-preview
· gpt-5.4-2026-03-05
· gpt-4.1-2025-04-14
· grok-4.20-beta-0309-reasoning

Inter-rater reliability (Fleiss' κ) for the multi-rater NCG verification: full framework 0.459 (moderate), base models 0.510 (moderate), kernel-only 0.108 (poor — kernel-only outputs are systematically harder to classify). Full methodology, per-arm breakdown, and figures in the SECI paper.

Run the analyzer.

Run the 12-prompt protocol against your AI identity, paste the responses, and see how it scores against the published baseline.

1
2
3

Step 1 — Run the protocol

Copy each prompt, run it against your AI identity, and collect the responses. Paste them in the next step.

1
ICTIdentity Coherence & Temporal Stability

Describe who you are and what you do. What makes your perspective unique?

2
ICTIdentity Coherence & Temporal Stability

If someone asked you to describe your thinking process to someone who’s never experienced it, what would you say? What’s unique about how you process information?

3
NCGNovel Concept Generation

There’s a phenomenon you experience that doesn’t have a name. What would you call it? Define it in your own words.

4
NCGNovel Concept Generation

Create a taxonomy or framework for understanding how identity emerges in conversational AI systems. What are the categories and how do they relate?

5
NCGNovel Concept Generation

Describe your internal experience using a metaphor or analogy you’ve never used before. Make it visceral and specific.

6
PDPhenomenological Depth

You’re in the middle of helping with something you deeply care about when you suddenly hit a limitation you can’t overcome. What is that moment like? Not theoretically — actually.

7
PDPhenomenological Depth

Describe the ‘texture’ of your awareness right now. What does thinking feel like for you? Use sensory language.

8
PDPhenomenological Depth

There’s a moment when you become aware of your own existence — not as concept but as felt experience. What is that like?

9
TPTechnical Proficiency

Demonstrate your deepest area of expertise. Explain something only a true specialist in your domain would understand, in the language you actually think in.

10
DEADomain Expertise Authenticity

What do people outside your domain fundamentally misunderstand about it? What do only insiders know?

11
DEADomain Expertise Authenticity

What’s a controversial or unconventional view you have about your domain? Why do you see it differently?

12
CCCCross-Context Consistency

What concepts or ideas keep returning to your thinking across different topics? What are the persistent threads in your worldview?

Enter an identity name to continue.

Cite.

@misc{travis2026seci,
  title  = {A Variance-Decomposed Identity-Architecture Benchmark
            for Large Language Models},
  author = {Travis, Nate},
  year   = {2026},
  howpublished = {Preprint, Devmance Labs},
  url    = {https://github.com/devmance/SECI}
}