ETL Pipeline

The Extraction, Transformation, and Loading (ETL) framework serves as the backbone of data processing within the Retrieval-Augmented Generation (RAG) use case.

The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.

The RAG use-case is designed to augment the capabilities of generative models by retrieving relevant information from a body of data to enhancing the quality and relevance of the generated output.

API Overview

DocumentReader

public interface DocumentReader extends Supplier<List<Document>> {

}

DocumentTransformer

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {

}

DocumentWriter

public interface DocumentWriter extends Consumer<List<Document>> {

}

Available Implementations

DocumentReader Interface

Supplier<List<Document>>

+ Provides a source of documents from diverse origins.

JsonReader

+ Parses documents in JSON format.

TextReader

+ Processes plain text documents.

Document

+ Represents the core data structure manipulated throughout the pipeline.

DocumentTransformer Interface

Function<List<Document>, List<Document>>

+ Transforms a batch of documents as part of the processing workflow.

TextSplitter

+ Divides documents to fit the AI model’s context window.

TokenTextSplitter

+ Splits documents while preserving token-level integrity.

ContentFormatTransformer

+ Ensures uniform content formats across all documents.

KeywordMetadataEnricher

+ Augments documents with essential keyword metadata.

SummaryMetadataEnricher

+ Enriches documents with summarization metadata for enhanced retrieval.

DocumentWriter Interface

Consumer<List<Document>>

+ Manages the final stage of the ETL process, preparing documents for storage.

VectorStore

+ The abstracted interface for vector database interactions.

MilvusVectorStore

+ An implementation for the Milvus vector database.

PgVectorStore

+ Provides vector storage capabilities using PostgreSQL.

SimplePersistentVectorStore

+ A straightforward approach to persistent vector storage.

InMemoryVectorStore

+ Enables rapid access with in-memory storage solutions.

Neo4jVectorStore

+ Leverages the Neo4j graph database for vector storage.

Using PDF Reader

Using PagePdfDocumentReader

PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(
	"file:document-readers/pdf-reader/src/test/resources/sample.pdf",
	PdfDocumentReaderConfig.builder()
			.withPageTopMargin(0)
			.withPageBottomMargin(0)
			.withPageExtractedTextFormatter(PageExtractedTextFormatter.builder()
					.withNumberOfTopTextLinesToDelete(0)
					.withNumberOfBottomTextLinesToDelete(3)
					.withNumberOfTopPagesToSkipBeforeDelete(0)
					.build())
			.withPagesPerDocument(1)
			.build());

var documents = pdfReader.get();

PdfTestUtils.writeToFile("document-readers/pdf-reader/target/sample.txt", documents, false);
public static void main(String[] args) throws IOException {

	ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
			"file:document-readers/pdf-reader/src/test/resources/sample2.pdf",
			PdfDocumentReaderConfig.builder()
					// .withPageBottomMargin(15)
					// .withReversedParagraphPosition(true)
					// .withTextLeftAlignment(true)
					.build());
	// ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
	// "file:document-readers/pdf-reader/src/test/resources/spring-framework.pdf",
	// PdfDocumentReaderConfig.builder()
	// .withPageBottomMargin(15)
	// .withReversedParagraphPosition(true)
	// // .withTextLeftAlignment(true)
	// .build());

	// PdfDocumentReader pdfReader = new
	// PdfDocumentReader("file:document-readers/pdf-reader/src/test/resources/uber-k-10.pdf",
	// PdfDocumentReaderConfig.builder().withPageTopMargin(80).withPageBottomMargin(60).build());

	var documents = pdfReader.get();

	writeToFile("document-readers/pdf-reader/target/sample2.txt", documents, true);
	System.out.println(documents.size());

}