This page explains the end-to-end process at a high level.


The Data Lifecycle

This workflow ensures that all data, whether authored from scratch or processed from existing documents, is clean, structured, and validated before being integrated into the knowledge base.

Current Workflow (v1.0)

This workflow shows the current process for handling the shown data types

graph TD;
    A[Start] --> B{Data Type?};
    B -- New Content --> C[📄 Author New Doc in GDocs];
    B -- Legacy/Existing PDF --> D[📄 Process with OCR Package];
    C --> E[🧪 Test & Validate Content];
    D --> E;
    E --> F[✅ Ready for Import];
    F --> G[🎉 Imported to Knowledge Base];

Proposed Future State Workflow (v2.0)

This diagram illustrates an improved workflow after implementing the proposed Unified Ingestion Engine . The key change is that all data sources are funneled into a single engine that produces a standardized JSON.

flowchart TD;
  subgraph 1 ["Source Documents"]
		  direction TB
      B[External PDF];
      C["Internal Documents
      (Exported from GDrive)"];
  end

  subgraph 2["Proposed: Unified Ingestion Engine"]
      direction TB
      D[Collector] --> E[Extractor]
      E --> F[Cleaner]
      F --> F1[Normalizer]
      F1 --> F2[Validator / Scorer]
  end
  
	F2 --> G(("Standardized JSON
  (for review)"))
  
  G --> 3
  
  
  subgraph 3[Downstream Process]
	    direction TB
      H[🧪Test and 
      Validate Content] --> I[✅Ready for Import]
      I --> J[🎉Imported to Knowledge Base]
  end

    B --> D
    C --> D

Click here for further details on the Unified Ingestion Engine.