Extract Data: write scripts (e.g., Python with Boto3 for S3, SQL queries for databases) to extract raw data from its source.
Standardize document formats: While Bedrock supports various formats, it's often beneficial to standardize them. For example, converting diverse documents to PDF or Markdown if that simplifies downstream processing or consistency checks.
ADAM uses Amazon Bedrock Knowledge Bases. It supports source files of the following formats:
Format | Extension |
---|---|
Plain text (ASCII only) | .txt |
Markdown | .md |
HyperText Markup Language | .html |
Microsoft Word document | .doc/.docx |
Comma-separated values | .csv |
Microsoft Excel spreadsheet | .xls/.xlsx |
Portable Document Format |
<aside>
Image PDF to Text PDF Processing: ‣
</aside>
Organize Data: Ensure data is logically organized. This means using clear folder structures, consistent naming conventions, and potentially partitioning data for easier management and ingestion.
<aside>
Other SOPs to be determined. Need to learn from supervisors how the data ingestion process works.
(e.g., chunking long documents, tagging/categorization, etc.)
</aside>