A technical roadmap.

<aside>

💡 Summary of Proposal:

This would be a single, automated module responsible for collecting, processing, and normalizing all incoming content. It would abstract away the complexity of handling different file types and deliver clean, consistent data to ADAM’s knowledge base.

Status: This is a proposal for a future project (v2.0) and is not part of the current workflow.

</aside>

Proposed Future State Workflow (v2.0)

This diagram illustrates an improved workflow after implementing the proposed Unified Ingestion Engine . The key change is that all data sources are funneled into a single engine that produces a standardized JSON.

flowchart TD;
  subgraph 1 ["Source Documents"]
		  direction TB
      B[External PDF];
      C["Internal Documents
      (Exported from GDrive)"];
  end

  subgraph 2["Proposed: Unified Ingestion Engine"]
      direction TB
      D[Collector] --> E[Extractor]
      E --> F[Cleaner]
      F --> F1[Normalizer]
      F1 --> F2[Validator / Scorer]
  end
  
	F2 --> G(("Standardized JSON
  (for review)"))
  
  G --> 3
  
  
  subgraph 3[Downstream Process]
	    direction TB
      H[🧪Test and 
      Validate Content] --> I[✅Ready for Import]
      I --> J[🎉Imported to Knowledge Base]
  end

    B --> D
    C --> D


Core Responsibilities

This proposed engine would have these primary responsibilities:

1. File Collection (Collector):

Gathers raw content. This component acts as the single entry point for all documents entering the pipeline.

2. Intelligent Extraction (Extractor):

Selects the correct tool to extract content based on file type. It would be designed to differentiate between prose (text), tables, figures, and their associated metadata.

3. Data Cleaning (Cleaner):

Corrects errors and removes artifacts from the raw content.

4. Data Normalization (Normalizer)

This is the most critical component. It transforms the raw, extracted data from the Cleaner into a single, standardized schema. The choice of schema has significant implications for the entire data pipeline, from manual validation to AI performance.

<aside>

Here are the two primary options for standardized data format and their trade-offs:

Data Normalization Formats

</aside>

5. Validation & Quality Scoring (Validator / Scorer)

Inspects the final JSON, scores its quality, and flags it for manual review if necessary.