Unified Ingestion Engine

A technical roadmap.

<aside>

💡 Summary of Proposal:

This would be a single, automated module responsible for collecting, processing, and normalizing all incoming content. It would abstract away the complexity of handling different file types and deliver clean, consistent data to ADAM’s knowledge base.

Status: This is a proposal for a future project (v2.0) and is not part of the current workflow.

</aside>

Proposed Future State Workflow (v2.0)

This diagram illustrates an improved workflow after implementing the proposed Unified Ingestion Engine . The key change is that all data sources are funneled into a single engine that produces a standardized JSON.

flowchart TD;
  subgraph 1 ["Source Documents"]
		  direction TB
      B[External PDF];
      C["Internal Documents
      (Exported from GDrive)"];
  end

  subgraph 2["Proposed: Unified Ingestion Engine"]
      direction TB
      D[Collector] --> E[Extractor]
      E --> F[Cleaner]
      F --> F1[Normalizer]
      F1 --> F2[Validator / Scorer]
  end
  
	F2 --> G(("Standardized JSON
  (for review)"))
  
  G --> 3
  
  
  subgraph 3[Downstream Process]
	    direction TB
      H[🧪Test and 
      Validate Content] --> I[✅Ready for Import]
      I --> J[🎉Imported to Knowledge Base]
  end

    B --> D
    C --> D

Core Responsibilities

This proposed engine would have these primary responsibilities:

1. File Collection (`Collector`):

Gathers raw content. This component acts as the single entry point for all documents entering the pipeline.

Core Responsibilities:

2. Intelligent Extraction (`Extractor`):

Selects the correct tool to extract content based on file type. It would be designed to differentiate between prose (text), tables, figures, and their associated metadata.

Select the Right Tool: Identifies the file type (e.g., PDF) and chooses the correct tool to read it.
Break Down Content: Pulls out the document's distinct parts: paragraphs, tables, titles, lists, and figure captions.

3. Data Cleaning (`Cleaner`):

Corrects errors and removes artifacts from the raw content.

Core Responsibilities

4. Data Normalization (`Normalizer`)

This is the most critical component. It transforms the raw, extracted data from the Cleaner into a single, standardized schema. The choice of schema has significant implications for the entire data pipeline, from manual validation to AI performance.

<aside>

Here are the two primary options for standardized data format and their trade-offs:

Data Normalization Formats

</aside>

5. Validation & Quality Scoring (`Validator / Scorer`)

Inspects the final JSON, scores its quality, and flags it for manual review if necessary.

Core Responsibilities

Proposed Future State Workflow (v2.0)

Core Responsibilities

1. File Collection (Collector):

2. Intelligent Extraction (Extractor):

3. Data Cleaning (Cleaner):

4. Data Normalization (Normalizer)

5. Validation & Quality Scoring (Validator / Scorer)

1. File Collection (`Collector`):

2. Intelligent Extraction (`Extractor`):

3. Data Cleaning (`Cleaner`):

4. Data Normalization (`Normalizer`)

5. Validation & Quality Scoring (`Validator / Scorer`)