This document outlines the design principles and justifications for the recommended JSON schemas. The schema is purpose-built to be scalable, traceable, and optimized for downstream processing by machine learning models and data pipelines, particularly for Retrieval-Augmented Generation (RAG) systems like Amazon Bedrock.
This is the direct output of the Extractor. It's designed to be messy and detailed, capturing everything possible from the source file before any cleaning.
{
"fileMetadata": { /* ... basic file info ... */ },
"contentChunks": [
{
"chunkId": "string",
"chunkType": "prose" | "table",
"rawContent": "string", // <-- Holds messy text or stringified table
"contextualMetadata": {
"pageNumber": "number"
},
"confidenceScore": "float" // <-- Crucial for validation
}
]
}
This is the clean, ideal representation of the document.
The Normalizer's entire job is to convert the Interim Schema into this structure.
<aside>
Update as needed.
{
"document_id": "string", // Unique ID for the document
"source_file": "string", // Original file name or S3 URI
"metadata": {
"author": "string | null", // May not always be available
"keywords": ["string"],
"status": "normalized" | "needs_review", // Set by the Validator
"quality_score": "float" // Final score from the Validator
},
"content_blocks": [
{
"block_id": "string", // Unique ID for this chunk
"type": "prose" | "table",
"page_number": "number", // Added for context
// The Normalizer decides which field to populate:
"content": "string | null", // For prose, or a sentence describing a table/figure
"data": [["string"]] | null // The raw structured data of a table, for reference
}
]
}
</aside>
rawContent
and assigns a confidenceScore
of 0.95.