Data Normalization Formats

Option A: Standardized Data Package (MD + CSV)

This approach prioritizes human readability and the use of universal file formats. The Normalizer would produce a folder or .zip file for each source document, containing:

One Markdown file (content.md): Containing all the prose, headings, lists, and other text content.
Multiple CSV files (table_1.csv, table_2.csv, etc.): Each file represents one table extracted from the source document. The Markdown file would reference them with a placeholder (e.g., [TABLE: table_1.csv]).

Implications of the Data Package Approach

<aside>

Pro - Human Readability: This format is very easy for non-technical team members to validate. The .md file can be opened in any text editor, and the .csv files can be reviewed in standard spreadsheet software like Google Sheets or Excel.
Pro - Portability: The output uses universal, open formats, ensuring the data is accessible and usable by a wide range of simple tools and scripts without requiring special parsers. </aside>

<aside>

Con - Context Fragmentation: This is the most significant drawback. The relationship between a paragraph in the Markdown file and the table it refers to in a separate .csv file is weak. This "context gap" can reduce the ability of the RAG system (ADAM) to pull all relevant information to answer a complex question.
Con - Complex File Management: Each source document is exploded into multiple files. This increases the complexity of file handling and raises the risk of files becoming mismatched or separated during data transit. </aside>

Option B: Standardized JSON File

This approach prioritizes data integrity and machine-readability for fully automated systems like the Amazon Bedrock RAG knowledge base. The Normalizer would produce a single .json file for each source document, structured with a list of "content blocks."

Example JSON Structure:

{
  "document_id": "grow-guide-tomato-v1",
  "source_file": "Tomatoes.gdoc",
  "metadata": {
    "author": "Justin Garcia",
    "status": "Final"
  },

  "content_blocks": [
    {
      "block_id": "001",
      "type": "prose",
      "content": "The optimal germination temperature for tomatoes is 25°C. The following table details the weekly feeding schedule."
    },
    {
      "block_id": "002",
      "type": "table",
      "caption": "Table 1: Weekly Feeding Schedule",
      "data": [
        ["Week", "Nitrogen", "Phosphorus", "Potassium"],
        ["1", "10", "5", "5"],
        ["2", "10", "7", "7"],
        ["3", "12", "10", "10"]
      ]
    },
    {
      "block_id": "003",
      "type": "prose",
      "content": "Ensure you monitor for signs of nutrient burn after week 3."
    }
  ]
}

Implications of the Structured JSON Approach

<aside>

Pro - Superior Data Integrity: The JSON object preserves the exact sequence and relationship of all content. Text, tables, and figures that belong together, stay together within a single data structure. This is optimal for the performance of RAG systems like Amazon Bedrock.
Pro - Atomic & Robust: The entire document is a single, self-contained file. This makes file management simpler and more reliable, ensuring no part of the document is ever lost or processed independently. </aside>

<aside>

Con - Poor Human Readability: A raw JSON file is difficult for non-developers to read and understand. Manual validation of the final processed output becomes challenging and may require a developer to build a special viewing tool.
Con - Higher Implementation Complexity: The Normalizer engine requires more sophisticated logic to correctly build and validate a nested JSON structure compared to simply writing out text and table files. </aside>

Final Recommendation: JSON

While a Data Package (MD + CSV) offers the best human-readability for manual validation, a structured JSON output is the architecturally superior choice for powering the Amazon Bedrock knowledge base. This format provides granular control over data chunking and metadata, preserving the critical context between text and tables.

This will result in more accurate and relevant answers from ADAM, providing a better experience for our users. Therefore, it is recommended that the Unified Ingestion Engine be designed with a standardized JSON schema as its primary output.