Overview Dashboard Signal Detail Methodology About the Lab
Methodology

Hybrid Textual Analysis Model

Transparent lexical scoring combined with an AI-compatible sentiment interface. All models run offline using deterministic scoring — deployable without API dependencies.

Cosine Similarity
TF-IDF vector representation of each annual filing, compared year-over-year using cosine distance. A score of 1.0 indicates identical term distributions; scores below 0.75 are treated as potentially significant narrative changes. This is the primary stability signal — other signals are used to characterize why the text changed.
Sentiment Analysis
Finance-oriented positive, negative, and uncertainty word ratios derived from the Loughran–McDonald (LM) financial sentiment dictionary. Net tone = positive ratio − negative ratio. A FinBERT-compatible output schema is included as a placeholder for supervised model integration once labeled data is available. The deterministic LM model runs fully offline.
Readability
Gunning Fog index, which estimates the years of formal education required to read a passage on first encounter. Index ≥ 12 is college-level; ≥ 17 is graduate-level. Year-over-year increases in Fog index are associated with defensive or legalistic disclosure language — a secondary signal in abnormal change scoring.
Topic Shift
TF-IDF keyword extraction identifies the dominant vocabulary cluster for each filing year. Topic shift score is computed as the Jaccard distance between the top-N keyword sets of adjacent years. High topic shift alongside a similarity drop suggests strategic repositioning or an external event introducing new vocabulary (e.g., a product launch, regulatory action, or acquisition).
Abnormal Detection
A weighted rule model combining four narrative signals (similarity drop, tone delta, readability delta, topic shift) with two contextual signals (ROA movement, leadership-event proximity). A year is flagged when several moderate signals or one very large narrative signal occur together. Thresholds are intentionally transparent and can be replaced with a statistical or supervised model once labeled data exists.

Scoring weights (current defaults): similarity drop 35% · tone delta 20% · readability delta 15% · topic shift 20% · ROA movement 5% · leadership-event proximity 5%.
Event Classification
Leadership events (CEO turnover, CFO succession, board changes) are classified by type (planned succession, forced departure, lateral hire, external appointment) and linked to the fiscal year in which they occurred. The model tests whether co-occurrence with narrative abnormality exceeds what would be expected by chance across the sample.

Abnormal Change Decision Logic

FLAG if: similarity_drop > 0.18
    AND (|Δtone| > 0.010 OR Δfog > 0.5 OR topic_shift > 0.35)
OR: similarity_drop > 0.30 (unconditional flag)
OR: combined_score > THRESHOLD (default 0.60)

Data Schema

Entity Key Fields Source
filings ticker · fiscal_year · section · raw_text EDGAR 10-K
textual_metrics word_count · net_tone · lm_pos/neg/unc · fog_index · topic_label Computed
narrative_change cosine_similarity · similarity_drop · Δtone · Δfog · topic_shift · abnormal_flag Computed
financial_metrics total_assets · roa · revenue_growth S&P Global
leadership_events event_date · event_type · turnover_type · successor_origin BoardEx / Manual