Data Collection and Organization SOPs:

Extract Data: write scripts (e.g., Python with Boto3 for S3, SQL queries for databases) to extract raw data from its source.

Standardize document formats: While Bedrock supports various formats, it's often beneficial to standardize them. For example, converting diverse documents to PDF or Markdown if that simplifies downstream processing or consistency checks.

ADAM uses Amazon Bedrock Knowledge Bases. It supports source files of the following formats:

Format	Extension
Plain text (ASCII only)	.txt
Markdown	.md
HyperText Markup Language	.html
Microsoft Word document	.doc/.docx
Comma-separated values	.csv
Microsoft Excel spreadsheet	.xls/.xlsx
Portable Document Format	.pdf

<aside>

Image PDF to Text PDF Processing: ‣

</aside>

Organize Data: Ensure data is logically organized. This means using clear folder structures, consistent naming conventions, and potentially partitioning data for easier management and ingestion.
1. Define logical structure according to topic/context.
2. Depending on whichever is more efficient given a collection of data, either
  1. Manually sort the files, or
  2. Write automated scripts to scan through and sort the documents.

<aside>

Other SOPs to be determined. Need to learn from supervisors how the data ingestion process works.

(e.g., chunking long documents, tagging/categorization, etc.)

</aside>