1. Extract Data: write scripts (e.g., Python with Boto3 for S3, SQL queries for databases) to extract raw data from its source.

  2. Standardize document formats: While Bedrock supports various formats, it's often beneficial to standardize them. For example, converting diverse documents to PDF or Markdown if that simplifies downstream processing or consistency checks.

  3. Organize Data: Ensure data is logically organized. This means using clear folder structures, consistent naming conventions, and potentially partitioning data for easier management and ingestion.

    1. Define logical structure according to topic/context.
    2. Depending on whichever is more efficient given a collection of data, either
      1. Manually sort the files, or
      2. Write automated scripts to scan through and sort the documents.

<aside>

Other SOPs to be determined. Need to learn from supervisors how the data ingestion process works.

(e.g., chunking long documents, tagging/categorization, etc.)

</aside>