This document outlines the design principles and justifications for the recommended JSON schemas. The schema is purpose-built to be scalable, traceable, and optimized for downstream processing by machine learning models and data pipelines, particularly for Retrieval-Augmented Generation (RAG) systems like Amazon Bedrock.

Schema 1: Interim Extraction Schema

This is the direct output of the Extractor. It's designed to be messy and detailed, capturing everything possible from the source file before any cleaning.

Structure:

{
  "fileMetadata": { /* ... basic file info ... */ },
  "contentChunks": [
    {
      "chunkId": "string",
      "chunkType": "prose" | "table",
      "rawContent": "string", // <-- Holds messy text or stringified table
      "contextualMetadata": {
        "pageNumber": "number"
      },
      "confidenceScore": "float" // <-- Crucial for validation
    }
  ]
}

Schema 2: Normalized Schema

This is the clean, ideal representation of the document.

The Normalizer's entire job is to convert the Interim Schema into this structure.

<aside>

Recommended Normalized Data Schema

Update as needed.

{
  "document_id": "string", // Unique ID for the document
  "source_file": "string", // Original file name or S3 URI
  "metadata": {
    "author": "string | null", // May not always be available
    "keywords": ["string"],
    "status": "normalized" | "needs_review", // Set by the Validator
    "quality_score": "float" // Final score from the Validator
  },
  "content_blocks": [
    {
      "block_id": "string", // Unique ID for this chunk
      "type": "prose" | "table",
      "page_number": "number", // Added for context
      // The Normalizer decides which field to populate:
      "content": "string | null", // For prose, or a sentence describing a table/figure
      "data": [["string"]] | null // The raw structured data of a table, for reference
    }
  ]
}

</aside>

How They Work Together

  1. Extractor pulls data from a PDF and creates a messy Interim Schema object. It puts the raw text of a paragraph in rawContent and assigns a confidenceScore of 0.95.