This approach prioritizes human readability and the use of universal file formats. The Normalizer
would produce a folder or .zip
file for each source document, containing:
content.md
): Containing all the prose, headings, lists, and other text content.table_1.csv
, table_2.csv
, etc.): Each file represents one table extracted from the source document. The Markdown file would reference them with a placeholder (e.g., [TABLE: table_1.csv]
).<aside>
.md
file can be opened in any text editor, and the .csv
files can be reviewed in standard spreadsheet software like Google Sheets or Excel.<aside>
.csv
file is weak. This "context gap" can reduce the ability of the RAG system (ADAM) to pull all relevant information to answer a complex question.This approach prioritizes data integrity and machine-readability for fully automated systems like the Amazon Bedrock RAG knowledge base. The Normalizer
would produce a single .json
file for each source document, structured with a list of "content blocks."
Example JSON Structure:
{
"document_id": "grow-guide-tomato-v1",
"source_file": "Tomatoes.gdoc",
"metadata": {
"author": "Justin Garcia",
"status": "Final"
},
"content_blocks": [
{
"block_id": "001",
"type": "prose",
"content": "The optimal germination temperature for tomatoes is 25°C. The following table details the weekly feeding schedule."
},
{
"block_id": "002",
"type": "table",
"caption": "Table 1: Weekly Feeding Schedule",
"data": [
["Week", "Nitrogen", "Phosphorus", "Potassium"],
["1", "10", "5", "5"],
["2", "10", "7", "7"],
["3", "12", "10", "10"]
]
},
{
"block_id": "003",
"type": "prose",
"content": "Ensure you monitor for signs of nutrient burn after week 3."
}
]
}
<aside>
<aside>
Normalizer
engine requires more sophisticated logic to correctly build and validate a nested JSON structure compared to simply writing out text and table files.
</aside>While a Data Package (MD + CSV) offers the best human-readability for manual validation, a structured JSON output is the architecturally superior choice for powering the Amazon Bedrock knowledge base. This format provides granular control over data chunking and metadata, preserving the critical context between text and tables.
This will result in more accurate and relevant answers from ADAM, providing a better experience for our users. Therefore, it is recommended that the Unified Ingestion Engine be designed with a standardized JSON schema as its primary output.