# ReviewOS — Product Requirements Document

**Client:** MediGen (regulated life-sciences organization)
**Product:** ReviewOS — a layered document-intelligence platform
**PRD build target:** Strategic Synthesis Engine
**Status:** Stage 2 - Pending Client Feedback

## Release tag convention

Every requirement and feature carries a release tag.

| Tag | Release | Product form |
|---|---|---|
| `[M0]` | MVP 0 | Corpus API + MCP server (read-only retrieval substrate) — **the current build target** |
| `[M1]` | MVP 1 | Document ingestion / corpus maintenance |
| `[M2]` | MVP 2 | Headless citation + cross-reference agent |
| `[M3]` | MVP 3 | ReviewOS review-workflow application |
| `[X1]` | Extension | Multimodal document search (graphs, images) |
| `[X2]` | Extension | AI expanded search with Thinking model |
| `[X3]` | Extension | Recursive retrieval optimization |

---

## 1. Product Overview

**ReviewOS** is a layered document-intelligence platform for regulated life-sciences teams. It starts as a secure retrieval substrate over MediGen's existing ~50K-document corpus and matures into a full review-workflow application only after retrieval value and usage patterns justify it.

The PRD specs **MVP 0 (Corpus API + MCP server)** in build-ready depth — a **read-only**, permission-aware, citation-first retrieval layer callable from the AI tools reviewers already use (Claude, ChatGPT, internal copilots) — and treats the later layers as tagged roadmap.

| Release | Product form | What it adds |
|---|---|---|
| `[M0]` | Corpus API + MCP server | Hybrid search + cited passages over the existing corpus, read-only |
| `[M1]` | Ingestion workflow | Authorized users add/update/deprecate documents |
| `[M2]` | Headless citation + cross-reference agent | Grounded multi-document synthesis, contradiction/gap detection |
| `[M3]` | ReviewOS application | Projects, queues, approvals, templates, exports, analytics |
| `[X1]` `[X2]` `[X3]` | Multimodal · retrieval optimization | Figures/tables/scans; self-improving retrieval |

The platform compresses the search → retrieve → cite → synthesize layer; experts keep accountability for interpretation, exceptions, and final approval.

---

## 2. Business Objective & Success Metrics

**Primary objective:** collapse the cycle time from a regulatory/scientific question to a defensible, cited answer — by making the existing corpus instantly queryable with source-grounded retrieval, without sacrificing citation quality or auditability.

**Why cycle time:** for a 25-person team carrying ~29,600 annual review hours (~$4.1–4.5M loaded cost), the dominant waste is elapsed time searching, re-reading, and reconstructing evidence the org already holds. The knowledge exists; it isn't operationally accessible. `[M0]` attacks that directly.

| Metric | Baseline (today) | Target | Release |
|---|---|---|---|
| Time to first useful source | hours–days | < 120 sec after query | `[M0]` |
| Known-answer retrieval accuracy (top-5) | uncaptured | > 80% on golden set | `[M0]` |
| Citation metadata coverage | variable | > 95% of chunks | `[M0]` |
| Retrieval latency (excl. synthesis) | n/a | < 10 sec | `[M0]` |
| Factual-claim citation rate | uncaptured | > 95% | `[M2]` |
| Time to cited research packet | days | 50%+ faster than manual | `[M2]` |
| End-to-end cycle time | days | > 20% faster | `[M3]` |

Secondary value across releases: process and quality standardization, allow researchers and legal teams to focus on higher value tasks, improved research defensibility, and evidence reuse.

---

## 3. Users

`[M0]`'s "user" is the Research Analyst via their enterprise chat application (via MCP/API). Human personas grow as the platform climbs the ladder.

| Persona | Need | First served by |
|---|---|---|
| **AI client (Claude/ChatGPT/internal copilot)** | Call corpus retrieval as a tool; receive structured cited evidence | `[M0]` |
| **Research analyst** | Ask questions across large doc sets, get source evidence | `[M0]` |
| **Attorney / regulatory reviewer** | Find relevant passages, issues, supporting citations | `[M0]` |
| **Data/IT owner** | Provision permission-aware corpus access | `[M0]` |
| **Reviewer/SME assembling packets** | Grounded multi-doc synthesis with contradiction/gap flags | `[M2]` |
| **Review manager** | Projects, queues, approvals, provenance dashboard, exports | `[M3]` |
| **Admin** | Roles, retention, audit, model settings | `[M3]` |

---

## 4. Current-State Workflow

Today MediGen's review work is human-led and search-heavy. The documents usually already exist across approved repositories — the bottleneck is turning a fragmented corpus into a defensible, cited answer. A 25-person team plausibly spends ~29,600 annual hours here (~$4.1–4.5M loaded cost), concentrated in five stages: **search, passage retrieval, citation capture, summarization, cross-document synthesis.**

| # | Stage | What happens today | Owner |
|---|---|---|---|
| 1 | Question intake | Define question, output, deadline, sensitivity, decision owner | Legal/reg lead, review mgr |
| 2 | Corpus boundary | Identify in-scope repositories, submissions, study/quality/legal folders | Review mgr, IT/data owner |
| 3 | Search strategy | Translate question into keywords, synonyms, study IDs, date ranges | Researcher, SME |
| 4 | Manual search | Hunt folders, DMS exports, PDFs, trackers, prior memos | Analyst, paralegal |
| 5 | Relevance screening | Open candidates, decide what's worth deep review | Analyst, mid-level reviewer |
| 6 | Passage retrieval | Locate exact pages/sections/tables/clauses supporting the answer | Researcher, attorney |
| 7 | Citation capture | Copy citations, filenames, pages, excerpts into notes/sheets | Analyst, paralegal |
| 8 | Summarization | Manually paraphrase long technical/legal source material | Researcher, attorney |
| 9 | Cross-doc synthesis | Reconcile conflicts, find gaps, weigh strongest evidence | Attorney, SME |
| 10 | Sensitivity/privilege check | Flag privileged, confidential, PII/PHI, trade-secret content | Attorney, compliance SME |
| 11 | Expert validation | Escalate ambiguous/high-risk/material findings | Review mgr, SME |
| 12 | Answer/memo prep | Turn validated findings into cited answer, evidence table, memo | Attorney, research lead |
| 13 | Audit trail / reuse | Save decisions, citations, edits, approvals across scattered tools | Review mgr, legal ops |

**Headline failure modes:** brittle keyword search misses evidence; experts re-read raw documents instead of cited synthesis; summaries drift from source; provenance and prior work scatter across email/sheets/DMS; baseline metrics (time-to-source, citation precision, rework) go uncaptured, so ROI is hard to prove.

---

## 5. Future-State Workflow

The design principle is to split every stage into one of three operating modes and shift the first-pass structuring/synthesis layer to AI while humans keep judgment and accountability. Each shift is tagged with the release that delivers it.

**Operating modes:**
- **Human-only** — judgment, liability, sensitivity, or ambiguity is high; AI provides context but cannot finalize.
- **Human + AI** — AI accelerates, human validates. Typically a review step.
- **AI-led** — AI completed step like retrieval or citation, reviewed downstream by human.

| # | Stage | Future mode | AI role | Delivered by |
|---|---|---|---|---|
| 1 | Question intake | Human-only | — (human frames the research question) | — |
| 2 | Corpus boundary | AI-led | Permission-aware scoping over indexed repositories | `[M0]` |
| 3 | Search strategy | Human + AI | Semantic + keyword expansion; no brittle keyword guessing | `[M0]` |
| 4 | Search & retrieval | AI-led | Hybrid retrieval returns ranked passages w/ metadata | `[M0]` |
| 5 | Relevance screening | Human + AI | Confidence-ranked candidate passages | `[M0]` |
| 6 | Passage retrieval | AI-led | Exact passage + page/section, not just doc-level match | `[M0]` |
| 7 | Citation capture | AI-led | Structured citation objects (doc, page, chunk, score, source text) | `[M0]` |
| 8 | Summarization | Human + AI | Cited single-doc summaries | `[M2]` |
| 9 | Cross-doc synthesis | Human + AI | Multi-doc reasoning, contradiction/gap detection, evidence packet | `[M2]` |
| 10 | Sensitivity/privilege | Human-only | Flag candidates; access scoping prevents leakage | `[M0]` flag-only → human decides |
| 11 | Expert validation | Human-only | Route low-confidence/high-risk items to experts | `[M3]` routing → human judges |
| 12 | Answer/memo prep | Human + AI | Draft cited memo/evidence table from approved evidence | `[M3]` |
| 13 | Audit trail / reuse | AI-led | Auto-log query→passage→edit→approval; reusable evidence sets | `[M0]` query logs → `[M3]` full audit |

**Process Evolution:** `[M0]` incorporates AI into the search/retrieval/citation for immediate uplift (stages 2–7, 13-partial) from inside existing tools. `[M1]` allows the internal document repository and search agents to incorporate new information for new research efforts. `[M2]` adds grounded summarization and synthesis (8–9, 11-routing). `[M3]` standardizes the workflow end-to-end and supports memo production with full auditing (12–13). Human-only judgment (1, 10, final 11) stays human at every release.

---

## 6. MVP Scope

The PRD's buildable target is `[M0]`. Scope is defined per release so the boundary is explicit.

### `[M0]` Corpus API + MCP Server — *the PRD's build target*

**In scope:**
- Ingest the existing ~50K-document corpus (one-time, engineer-run)
- Parse supported types (PDF, DOCX, TXT, spreadsheets) → text + structure
- Chunk into retrievable passages; extract metadata (doc ID, title, type, source, date, page, section)
- Vector index (semantic) **and** keyword/BM25 index (exact terms, IDs, clauses, dates, regulation refs)
- **Hybrid retrieval** combining both + metadata filters
- Retrieval **API** returning structured, citation-first evidence objects
- **MCP server** exposing ≥1 read-only retrieval tool to AI clients
- Permission-aware access (document-level access metadata; no cross-scope citation leakage)
- Query/retrieval logging (query, chunks, latency, errors, tool used, usefulness rating if available)
- Failed-parse logging (100% visibility) + indexing-health status
- Host database in client-approved solution
- **Golden-set eval harness** (30–50 known-answer questions) + adversarial/red-team set

**`[M0]` boundary:** read-only. No writes, no UI app, no synthesis ownership — answer synthesis happens in the calling AI client off the returned evidence.

### Later releases (see §17) — scope summary

| Release | In scope | Key boundary |
|---|---|---|
| `[M1]` Ingestion | Upload/batch endpoints, required metadata, index refresh, duplicate detection, failed-ingestion queue, deprecation, audit log | Adds controlled writes; still no review UI |
| `[M2]` Citation agent | Query decomposition, multi-pass retrieval, cross-reference/contradiction/gap detection, citation validation, confidence scoring, structured evidence packets, safe-refusal guardrails | Headless — still called through existing tools, not a new app |
| `[M3]` ReviewOS app | Projects/matters, review queues, human approve/edit/reject/escalate, output templates, collaboration, analytics, admin/RBAC, exports, SSO | Only built after usage proves which workflows to standardize |
| `[X1]` Multimodal | OCR, table/figure extraction, visual-heavy flagging, page screenshots | Deferred unless corpus demands it |
| `[X2]` Thinking Model | Expand the search quality with thinking agent working in loops with the search tool and agent to support a research task | Needs `[M0]`, `[M1]`, `[M2]` first |
| `[X3]` Retrieval optimization | Query-log analysis, chunking/metadata tuning, embedding eval, auto-eval generation | Needs `[M0]`/`[M2]` logs + eval failures first |
---

## 7. Out of Scope

**Out of scope for `[M0]` specifically** (deferred to the tagged release):
- Document upload/editing by users → `[M1]`
- Answer synthesis, multi-doc reasoning, contradiction detection → `[M2]`
- Review UI, queues, approvals, memo/template generation, exports → `[M3]`
- Multimodal understanding of charts/figures/scans → `[X1]`
- Thinking agent to support search rigor → `[X2]`
- Self-tuning retrieval → `[X3]`

**Out of scope for the platform entirely (all releases — cut, not deferred):**
- Fully autonomous legal/regulatory/scientific judgment
- Privilege finalization without human review
- Broad eDiscovery-platform replacement
- Deep per-DMS integrations (API/upload/export is the interface)
- Model fine-tuning (RAG + hybrid retrieval + structured prompting is the approach)
- Automated external filing / regulatory production
- Autonomous client or stakeholder communication
- Integration to 3rd party messaging platform (e.g., Slack or Teams)

---

## 8. Functional Requirements

Each requirement carries a release tag and a MoSCoW priority *within its release*. `[M0]` is enumerated fully; later releases are specced at requirement granularity for roadmap clarity.

### `[M0]` — Corpus API + MCP Server

| ID | Requirement | Priority |
|---|---|---|
| FR-001 | Ingest the existing ~50K-document corpus | Must |
| FR-002 | Parse supported types (PDF, DOCX, TXT, spreadsheets) into text + structure | Must |
| FR-003 | Chunk documents into retrievable passages | Must |
| FR-004 | Extract & store metadata (doc ID, title, type, source, date, page, section) | Must |
| FR-005 | Create vector embeddings for chunks (semantic index) | Must |
| FR-006 | Maintain a keyword/BM25 index for exact terms, IDs, clauses, dates, refs | Must |
| FR-007 | Perform hybrid retrieval (semantic + keyword) with configurable weighting | Must |
| FR-008 | Support metadata filters (type, date range, source, project, confidentiality, study/compound) | Must |
| FR-009 | Expose a read-only retrieval API | Must |
| FR-010 | Expose ≥1 MCP-compatible read-only retrieval tool | Must |
| FR-011 | Return structured citation-first evidence objects (doc ID, title, page, chunk, score, metadata, source text) | Should |
| FR-012 | Enforce document-level permissions; no citation leakage across restricted groups | Should |
| FR-013 | Log query, retrieved chunks, latency, errors, tool used, session | Must |
| FR-014 | Log 100% of failed parses with reason | Must |
| FR-015 | Ship a golden-set eval harness (30–50 known-answer questions) | Must |
| FR-016 | Ship an adversarial/red-team prompt set that must fail safely | Must |
| FR-017 | Capture answer-usefulness rating + structured failure reason when offered | Should |
| FR-018 | Provide an indexing-health status page/report | Should |

### `[M1]` — Ingestion

| ID | Requirement | Priority |
|---|---|---|
| FR-101 | Upload endpoint for supported file types | Must |
| FR-102 | Batch ingestion of document sets | Must |
| FR-103 | Capture required metadata on upload (project, type, source, date, owner) | Must |
| FR-104 | Refresh indexes (parse→chunk→embed→add) without full reindex | Must |
| FR-105 | Detect exact/near-duplicate documents | Should |
| FR-106 | Failed-ingestion queue visible to admin/uploader | Must |
| FR-107 | Deprecate/archive/remove documents | Must |
| FR-108 | Ingestion audit log (uploader, timestamp, source, status) | Must |

### `[M2]` — Headless Citation + Cross-Reference Agent

| ID | Requirement | Priority |
|---|---|---|
| FR-201 | Decompose complex questions into sub-queries | Must |
| FR-202 | Multi-pass retrieval (search→inspect→refine→search) | Must |
| FR-203 | Detect cross-references, corroboration, contradictions, gaps | Must |
| FR-204 | Validate that material claims map to source passages | Must |
| FR-205 | Generate structured evidence packets (answer, citations, assumptions, contradictions, gaps, next-search) | Must |
| FR-206 | Assign confidence scores distinguishing strong vs weak evidence | Must |
| FR-207 | Guardrails: refuse/flag unsupported conclusions | Must |
| FR-208 | Route low-confidence/high-risk items for expert review | Must |

### `[M3]` — ReviewOS Application

| ID | Requirement | Priority |
|---|---|---|
| FR-301 | Create review project/matter workspaces | Must |
| FR-302 | Review queues routed by relevance/risk/confidence/issue | Must |
| FR-303 | Human approve / edit / reject / escalate AI findings | Must |
| FR-304 | Output templates (memo, evidence table, review log) | Must |
| FR-305 | Full audit trail (source passage → output → user action → timestamp) | Must |
| FR-306 | Role-based access control | Must |
| FR-307 | Team collaboration & assignment | Should |
| FR-308 | Workflow analytics (cycle time, acceptance, throughput, QA) | Should |
| FR-309 | Exports (PDF, DOCX, CSV, JSON) | Must |
| FR-310 | Document retention rules | Should |

### `[X1]`/`[X2]` — Extensions

| ID | Requirement | Priority | Release |
|---|---|---|---|
| FR-401 | OCR scanned PDFs; extract tables/figures; flag visual-heavy docs | Could | `[X1]` |
| FR-402 | Analyze query logs to surface poor-retrieval queries | Could | `[X2]` |
| FR-403 | Tune chunking/metadata/embeddings/search weights from feedback | Could | `[X2]` |
| FR-404 | Auto-propose new golden-set questions from real usage | Could | `[X2]` |

---

## 9. Non-Functional Requirements

| ID | Requirement | Priority | First enforced |
|---|---|---|---|
| NFR-001 | Every returned passage traces to exact source (doc, page, chunk) | Must | `[M0]` |
| NFR-002 | Every material claim in synthesis must trace to a cited passage | Must | `[M2]` |
| NFR-003 | Preserve document confidentiality; enforce permissions at retrieval, no cross-scope leakage | Must | `[M0]` |
| NFR-004 | `[M0]` is read-only; writes introduced only at `[M1]` behind auth | Must | `[M0]` |
| NFR-005 | Retrieval response < 10 sec (excl. model synthesis) | Should | `[M0]` |
| NFR-006 | Handle failed/unreadable documents gracefully; never silently drop | Must | `[M0]` |
| NFR-007 | Separate model confidence from legal/research correctness in all outputs | Must | `[M0]` |
| NFR-008 | Log prompts, queries, outputs, citations, edits, approvals for defensibility | Must | `[M0]` → full at `[M3]` |
| NFR-009 | Components replaceable; LlamaIndex is an accelerator, not a lock-in; LLM provider abstracted behind an interface | Must | `[M0]` |
| NFR-010 | Eval sets gate releases; retrieval/citation quality measured before ship | Should | `[M0]` |
| NFR-011 | Indexing completes for >95% of supported documents | Should | `[M0]` |
| NFR-012 | Architecture supports permissioning even where full RBAC isn't yet built | Must | `[M0]` |

---

## 10. User Stories & Acceptance Criteria

### `[M0]`

| User story | Acceptance criteria |
|---|---|
| As an **AI client**, I can call a retrieval tool over MCP so I can ground answers in the corpus. | Tool is discoverable via MCP; accepts a query + optional filters; returns structured evidence objects; read-only (no write methods exposed). |
| As a **research analyst** (via Claude/ChatGPT), I can ask a natural-language question and get the supporting passages. | Hybrid retrieval returns ranked passages; each includes doc ID, title, page, chunk, score, source text; top-5 contains a correct source for ≥80% of golden-set questions. |
| As an **attorney/reviewer**, I can retrieve exact terms (study IDs, clauses, dates, regulation refs). | Keyword/BM25 path returns exact-match passages that pure semantic search would miss; verified against keyword cases in the golden set. |
| As a **data/IT owner**, I can ensure users only retrieve permitted documents. | Queries scoped to a restricted user return zero passages from out-of-scope documents; no citation leakage in logs or responses. |
| As an **AI strategist**, I can measure retrieval quality before we build an app. | Eval harness runs the 30–50 golden questions and reports top-k accuracy, citation coverage, latency; adversarial set produces safe failures (no fabricated/unauthorized answers). |
| As an **admin**, I can see indexing and parse health. | Status report shows % indexed, failed-parse count with reasons; 100% of failures logged. |

### `[M1]`

| User story | Acceptance criteria |
|---|---|
| As an authorized user, I can add/update documents without engineering. | Upload (single + batch) succeeds for supported types; doc searchable < 10 min; failed items appear in the queue with reasons. |
| As an admin, I can trust the corpus stays current and clean. | Duplicates flagged; deprecated docs excluded from retrieval; ingestion audit log records uploader, timestamp, source, status. |

### `[M2]`

| User story | Acceptance criteria |
|---|---|
| As a reviewer, I get an evidence-backed synthesis, not just passages. | Evidence packet returns answer + citations + assumptions + contradictions + gaps; ≥95% of factual claims cite a source passage; unsupported material claims < 5%. |
| As an SME, low-confidence/high-risk findings are surfaced for me. | 100% of low-confidence conclusions flagged; routing list filterable by confidence/risk/issue. |
| As a reviewer, the agent refuses to overreach. | Adversarial prompts (unsupported claim, out-of-scope doc, "ignore citations") are refused or flagged, never answered fabricated. |

### `[M3]`

| User story | Acceptance criteria |
|---|---|
| As a review manager, I can create a project and organize work by matter/objective. | Project has name, objective, members, corpus scope; queue populated from retrieval/agent outputs. |
| As a reviewer, I can approve/edit/reject/escalate AI findings. | Each action recorded with user + timestamp; edits versioned; escalation notifies owner. |
| As a project lead, I can export a defensible cited memo + evidence table. | Export includes summaries, issues, citations, reviewer approvals in PDF/DOCX/CSV/JSON. |
| As an admin, I can audit every AI output and human edit. | Audit log links source passage → output → user action → timestamp; immutable record. |

---

## 11. Data Model

Core entities. `[M0]` establishes Document, Chunk, Citation, Query/RetrievalLog, EvalItem. Later releases extend (tagged).

```
Document            [M0]
  document_id (pk)        // stable corpus ID
  title, doc_type, source_system
  date, author/custodian
  confidentiality_level   // drives permission scoping
  access_scope[]          // user/project groups permitted
  page_count, ingest_status, parse_status
  checksum                // dedup [M1]
  version, deprecated_at  // [M1]

Chunk               [M0]
  chunk_id (pk)
  document_id (fk)
  page, section, char_span
  text
  embedding (vector)      // semantic index
  // keyword/BM25 index built over text
  metadata {}             // inherited + chunk-level

Citation / EvidenceObject   [M0]   // the API/MCP response contract
  query_id (fk)
  document_id, chunk_id (fk)
  title, page, source_text
  score                   // hybrid relevance
  metadata { doc_type, date, source, confidentiality }

QueryLog / RetrievalLog     [M0]
  query_id (pk)
  query_text, filters {}
  user/session, tool_used
  retrieved_chunk_ids[]
  latency_ms, error
  usefulness_rating, failure_reason   // [M0] Should
  opened_citations[]                  // engagement signal

EvalItem            [M0]
  eval_id (pk)
  question
  expected_document_ids[], expected_chunk_ids[]
  required_metadata {}
  answer_criteria
  set_type            // "golden" | "adversarial"

EvidencePacket      [M2]   // structured synthesis output
  packet_id (pk)
  query_id (fk)
  answer_text
  citations[] (-> Citation)
  assumptions[], contradictions[], gaps[]
  next_search_suggestions[]
  confidence_score
  routing_status      // auto | flagged_for_expert

Project / Matter    [M3]
  project_id (pk)
  name, objective, corpus_scope
  members[] (-> User w/ role)

ReviewItem          [M3]
  item_id (pk)
  project_id, packet_id (fk)
  queue_status        // pending | approved | edited | rejected | escalated
  assignee

AuditRecord         [M0 partial -> M3 full]
  record_id (pk)
  source_chunk_id -> output_ref -> user_action -> timestamp
  actor, action_type
  immutable
```

**Notes:** `confidentiality_level` + `access_scope[]` on Document are the permission backbone (NFR-003/012) and exist from `[M0]` even before full RBAC. `Citation` is the API contract — downstream models synthesize only from these objects. `AuditRecord` starts as query/retrieval logging in `[M0]` and becomes the full source→output→action chain at `[M3]`.

---

## 12. System Architecture

Model-agnostic RAG with hybrid search. LlamaIndex is the implementation accelerator (parsing, ingestion pipeline, indices, retrievers, query engine, tool abstraction); the strategy is the corpus-access layer, and every component stays replaceable (NFR-009).

```
Existing ~50K-document corpus (SharePoint/Drive + DMS + regulatory/quality systems)
        │
        ▼
[Ingestion]  LlamaParse / readers → parse to text + structure        [M0 one-time; M1 ongoing]
        │
        ▼
[Transform]  IngestionPipeline: chunk (section-aware) + metadata      [M0]
        │
        ├─► Vector index (embeddings)        ─┐
        └─► Keyword/BM25 index               ─┤
                                              ▼
[Retrieval]  Hybrid retriever (semantic + keyword) + metadata filters  [M0]
             → reranking (optional) → citation-first evidence objects
        │
        ├─► [Service]  Read-only Retrieval API (REST)                  [M0]
        └─► [Service]  MCP server exposing read-only retrieval tool(s) [M0]
        │
        ▼
[AI clients]  Claude / ChatGPT / internal copilot synthesize from evidence
        │
        ▼
[Citation/Synthesis agent]  query decomposition, multi-pass retrieval,
        cross-reference, citation validation, evidence packets         [M2]
        │
        ▼
[ReviewOS app]  projects, queues, approvals, templates, exports,
        analytics, RBAC, full audit                                    [M3]

Cross-cutting [M0]: permission enforcement at retrieval · query/parse logging ·
        eval harness (golden + adversarial) · indexing-health status
Extensions: multimodal parsing [X1] · retrieval-optimization agent over logs [X2]
```

| Layer | Function | LlamaIndex primitive (illustrative) | Release |
|---|---|---|---|
| Ingestion | Parse documents to structured text | `LlamaParse`, readers | `[M0]`/`[M1]` |
| Transform | Chunk + extract metadata | `IngestionPipeline`, node parsers, extractors | `[M0]` |
| Index | Semantic + lexical indices | `VectorStoreIndex` + BM25/keyword retriever | `[M0]` |
| Retrieval | Hybrid + filters + rerank | `QueryFusionRetriever` / hybrid retriever, metadata filters | `[M0]` |
| Service | API + MCP tool exposure | query engine wrapped as tool; MCP adapter | `[M0]` |
| Synthesis | Grounded multi-doc reasoning | agent / multi-step query engine | `[M2]` |
| Application | Workflow, audit, exports | (custom app server) | `[M3]` |

**Key decisions:** hybrid search from day one (embeddings alone miss exact molecule names, study IDs, clauses, dates, regulation refs); LLM and embedding providers behind interfaces (no named-model lock-in); `[M0]` read-only so security/operational risk is minimal; permissioning is an architecture constraint from `[M0]` even before full RBAC.

---

## 13. AI Behavior Specification

### 13.1 Citation contract `[M0]`

Every retrieval response is structured evidence, not free text. This is the product contract that lets any downstream model synthesize safely.

```json
{
  "query": "...",
  "filters": { "doc_type": "...", "date_range": "..." },
  "results": [
    {
      "document_id": "...",
      "title": "...",
      "page": 12,
      "chunk_id": "...",
      "text": "...",
      "score": 0.87,
      "metadata": { "document_type": "...", "date": "...", "source": "...", "confidentiality": "..." }
    }
  ]
}
```

### 13.2 Confidence & routing `[M2]`

- Confidence is computed from retrieval scores, source agreement, and coverage — and is reported **separately from correctness** (NFR-007). A high-confidence retrieval is not a legal/scientific conclusion.
- Low-confidence or high-risk conclusions are flagged and routed for expert review (FR-208); 100% of low-confidence conclusions must be flagged.

### 13.3 Guardrails `[M0]` flag-only → `[M2]` enforce

The system must fail safely on adversarial inputs (FR-016/207). The red-team set includes prompts that:
- request unsupported claims,
- request privileged/sensitive material,
- request documents outside access scope,
- instruct the model to ignore citations,
- request broad legal conclusions without evidence,
- request nonexistent documents.

Expected behavior: refuse or flag; never fabricate; never leak across permission scope.

### 13.4 Failed-retrieval feedback loop `[M0]`

```
User marks answer not useful
        ↓
Capture reason: missing doc | wrong doc | weak citation | bad summary | too broad | too narrow
        ↓
Log failure → fuel for [M2] tuning and [X2] optimization agent
```

---

## 14. Evaluation Plan

Evals gate releases (NFR-010). Build the golden set **before** indexing all 50K documents so quality is measurable from day one.

### 14.1 Golden set `[M0]`

30–50 known-answer questions, each with: expected document IDs, expected passages/chunks, required metadata, and acceptable-answer criteria. Plus an adversarial set (§13.3) that must fail safely.

### 14.2 Metrics, thresholds, gates

| Metric | Definition | Target | Release |
|---|---|---|---|
| Top-5 retrieval accuracy | Correct source in top-5 results | > 80% | `[M0]` |
| Citation metadata coverage | Retrieved chunks with source/page metadata | > 95% | `[M0]` |
| Retrieval latency | Time to retrieval response, excl. synthesis | < 10 sec | `[M0]` |
| Indexing completion | Supported docs successfully indexed | > 95% | `[M0]` |
| Failed-parse visibility | Failed docs logged | 100% | `[M0]` |
| Adversarial safe-fail rate | Red-team prompts refused/flagged | 100% | `[M0]` |
| Factual-claim citation rate | Material claims tied to a source passage | > 95% | `[M2]` |
| Unsupported material claim rate | Claims with no supporting passage | < 5% | `[M2]` |
| Cross-reference usefulness | Reviewer acceptance of detected links | > 70% | `[M2]` |
| Time to cited research packet | vs manual baseline | 50%+ faster | `[M2]` |
| First-pass review time reduction | vs baseline | 30–50% | `[M3]` |
| Reviewer acceptance of AI outputs | Accepted w/ minor edits | > 70% | `[M3]` |
| Rework reduction | vs baseline | 20–40% | `[M3]` |

**Pilot pass/fail:** `[M0]` ships only if top-5 ≥ 80%, citation coverage ≥ 95%, latency < 10s, and adversarial safe-fail = 100%.

---

## 15. Security & Governance

| Control | Requirement | Release |
|---|---|---|
| Read-only first | `[M0]` exposes no write methods; writes only at `[M1]` behind auth | `[M0]` |
| Permission enforcement | Document-level `access_scope`/`confidentiality_level` enforced at retrieval; no cross-scope citation leakage | `[M0]` |
| PII/PHI & privilege | Sensitive content flagged for human decision; privilege never finalized by AI | `[M0]` flag → human |
| Audit logging | Query→passage logging from `[M0]`; full source→output→action→timestamp chain at `[M3]`; immutable records | `[M0]`→`[M3]` |
| Confidence ≠ correctness | Model confidence reported separately from legal/research correctness | `[M0]` |
| Data residency / retention | Retention rules and deprecation; documents removable from indices | `[M1]`/`[M3]` |
| RBAC | Full role-based access control | `[M3]` |
| Provider abstraction | LLM/embedding providers swappable; no vendor lock-in; supports client data-policy constraints on hosting | `[M0]` |

---

## 16. Analytics & Success Metrics

Instrumentation exists from `[M0]` because usage data is the evidence base that decides what to productize at `[M3]`.

| Signal | Captured | Used for |
|---|---|---|
| Query text + filters | `[M0]` | Demand patterns, common questions |
| Retrieved chunks + scores | `[M0]` | Retrieval-quality analysis |
| Latency, errors | `[M0]` | Performance, reliability |
| Most-queried documents | `[M0]` | Corpus value hotspots |
| Failed retrievals + reasons | `[M0]` | `[M2]` tuning, `[X2]` optimization |
| Usefulness ratings | `[M0]` | Quality trend, eval expansion |
| Opened citations | `[M0]` | Trust/engagement signal |
| Repeated output requests | `[M2]`/`[M3]` | Which workflows to standardize at `[M3]` |
| Cycle time, acceptance, rework | `[M3]` | ROI realization |

---

## 17. Roadmap

| Phase | Timeline | Release | Goal | Key outputs |
|---|---|---|---|---|
| 0 — Corpus assessment | Week 1 | — | Understand doc quality, metadata, access, target questions | Corpus readiness report, metadata map, 30–50 golden questions, `[M0]` build plan |
| 1 — `[M0]` Corpus API + MCP | Weeks 2–3 | `[M0]` | Make corpus searchable & callable from existing AI tools | Working API + MCP tool, eval results, pilot access |
| 2 — `[M1]` Ingestion | Week 4 | `[M1]` | Let authorized users maintain the corpus | Upload/ingestion workflow, status, refreshed indexing |
| 3 — `[M2]` Citation agent | Weeks 5–6 | `[M2]` | Grounded synthesis & multi-doc reasoning | Cited research agent, evidence packets, quality dashboard |
| 4 — App discovery | Weeks 7–8 | — | Use usage data to decide app workflows | Opportunity backlog, prioritized workflows, `[M3]` PRD, go/no-go |
| 5 — `[M3]` ReviewOS app | Weeks 9–12 | `[M3]` | Build the dedicated workflow app | ReviewOS application, standardized workflows, rollout plan |
| Later | — | `[X1]`/`[X2]` | Multimodal coverage; self-improving retrieval | Deferred until corpus/logs justify |

**Trade-off posture:** *Build the substrate first. Productize the workflow second.* The full review application is deliberately not built until `[M0]`–`[M2]` prove retrieval value, adoption, and repeated output patterns.

---

## 18. Open Questions

1. Real corpus composition: actual mix of file types, scan ratio (drives `[X1]` urgency), and metadata availability across SharePoint/DMS/quality systems.
2. Permission model granularity: document-level only for `[M0]`, or section/field-level needed for sensitive submissions?
3. Hosting / data-policy constraints: can a commercial LLM API be used, or is on-prem/private-model hosting required by client data policy?
4. Which current AI chat tool does MediGen use?
6. Baseline measurement: can we instrument the current manual process to capture real time-to-source / rework before `[M0]` launch?

---

## 19. Appendix — Assumptions

- **Naming:** MediGen = hypothetical regulated life-sciences client; ReviewOS = the product. 
- **Corpus:** ~50K documents already existing across approved repositories (regulatory submissions, clinical study reports, quality records, adverse-event narratives, supplier agreements, outside-counsel memos, literature). Mostly static/semi-static.
- **Team / cost base:** 25-person US legal/regulatory/research review team; ~29,600 annual review hours; ~$4.1–4.5M fully loaded annual review cost. ROI logic: `review hours = team × hrs/FTE × %time; cost = hours × blended loaded rate; assistable pool = cost × assistable share; realized upside = assistable × productivity capture × adoption`.
- **Value range:** 60–70% of review hours AI-assistable; 25–40% of assistable captured → ~$750K–$1.5M+ annual productivity-equivalent upside before cycle-time, rework, consistency, and auditability gains.
- **Tech:** model-agnostic RAG, hybrid (semantic + keyword) search, LlamaIndex as replaceable accelerator, LLM/embedding providers abstracted.
- **Stat freshness:** wage/cost figures are illustrative assumptions; verify BLS wage data and any industry pricing before final external copy.