ETL Pipeline
The Extraction, Transformation, and Loading (ETL) framework serves as the backbone of data processing within the Retrieval-Augmented Generation (RAG) use case.
The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.
The RAG use-case is designed to augment the capabilities of generative models by retrieving relevant information from a body of data to enhancing the quality and relevance of the generated output.
API Overview
DocumentTransformer
public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
}
DocumentTransformer Interface
- Function<List<Document>, List<Document>>
-
+ Transforms a batch of documents as part of the processing workflow.
- TextSplitter
-
+ Divides documents to fit the AI model’s context window.
- TokenTextSplitter
-
+ Splits documents while preserving token-level integrity.
- ContentFormatTransformer
-
+ Ensures uniform content formats across all documents.
- KeywordMetadataEnricher
-
+ Augments documents with essential keyword metadata.
- SummaryMetadataEnricher
-
+ Enriches documents with summarization metadata for enhanced retrieval.
DocumentWriter Interface
- Consumer<List<Document>>
-
+ Manages the final stage of the ETL process, preparing documents for storage.
- VectorStore
-
+ The abstracted interface for vector database interactions.
- MilvusVectorStore
-
+ An implementation for the Milvus vector database.
- PgVectorStore
-
+ Provides vector storage capabilities using PostgreSQL.
- SimplePersistentVectorStore
-
+ A straightforward approach to persistent vector storage.
- InMemoryVectorStore
-
+ Enables rapid access with in-memory storage solutions.
- Neo4jVectorStore
-
+ Leverages the Neo4j graph database for vector storage.
Using PagePdfDocumentReader
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(
"file:document-readers/pdf-reader/src/test/resources/sample.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageBottomMargin(0)
.withPageExtractedTextFormatter(PageExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.withNumberOfBottomTextLinesToDelete(3)
.withNumberOfTopPagesToSkipBeforeDelete(0)
.build())
.withPagesPerDocument(1)
.build());
var documents = pdfReader.get();
PdfTestUtils.writeToFile("document-readers/pdf-reader/target/sample.txt", documents, false);
public static void main(String[] args) throws IOException {
ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
"file:document-readers/pdf-reader/src/test/resources/sample2.pdf",
PdfDocumentReaderConfig.builder()
// .withPageBottomMargin(15)
// .withReversedParagraphPosition(true)
// .withTextLeftAlignment(true)
.build());
// ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
// "file:document-readers/pdf-reader/src/test/resources/spring-framework.pdf",
// PdfDocumentReaderConfig.builder()
// .withPageBottomMargin(15)
// .withReversedParagraphPosition(true)
// // .withTextLeftAlignment(true)
// .build());
// PdfDocumentReader pdfReader = new
// PdfDocumentReader("file:document-readers/pdf-reader/src/test/resources/uber-k-10.pdf",
// PdfDocumentReaderConfig.builder().withPageTopMargin(80).withPageBottomMargin(60).build());
var documents = pdfReader.get();
writeToFile("document-readers/pdf-reader/target/sample2.txt", documents, true);
System.out.println(documents.size());
}