Dimensionality Reduction in Embedding Spaces: A Comparative Analysis of DCT and SVD for RAG Systems
Abstract
We present a methodological framework for evaluating dimensionality reduction techniques in dense embedding spaces for Retrieval-Augmented Generation (RAG) systems. Specifically, we examine Discrete Cosine Transform (DCT) and Singular Value Decomposition (SVD) applied to the column space of embedding matrices, followed by quantization. Our evaluation protocol centers on retrieval consistency - measuring whether reduced representations retrieve the same contextual chunks as full-dimensional embeddings. This work was conducted in collaboration with Ethan Davis at SAS Institute Inc.
Introduction
Dense embeddings from transformer-based models typically operate in high-dimensional spaces (d ∈ {384, 768, 1024, 1536}). While these representations capture rich semantic information, their dimensionality poses computational and storage challenges. This work examines column-space reduction techniques that aim to preserve retrieval quality in RAG pipelines.
Theoretical Framework
Column Space Reduction
Given an embedding matrix \(\mathbf{E} \in \mathbb{R}^{n \times d}\) where \(n\) represents the number of documents and \(d\) the embedding dimension, we seek a transformation \(\mathbf{T}: \mathbb{R}^d \rightarrow \mathbb{R}^k\) where \(k \ll d\).
Discrete Cosine Transform (DCT)
The DCT provides an orthogonal transformation that concentrates signal energy in low-frequency components:
\(\mathbf{E}' = \mathbf{E} \mathbf{D}^T\)
where \(\mathbf{D} \in \mathbb{R}^{d \times d}\) is the DCT matrix. We retain the first \(k\) columns of \(\mathbf{E}'\), exploiting the energy compaction property commonly observed in natural signals.
Singular Value Decomposition (SVD)
We apply SVD directly to the embedding matrix for dimensionality reduction:
- Compute SVD: \(\mathbf{E} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T\)
- Truncate to \(k\) components: \(\mathbf{E}_k = \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^T\)
- Project embeddings: \(\mathbf{E}' = \mathbf{E} \mathbf{V}_k\)
where \(\mathbf{V}_k \in \mathbb{R}^{d \times k}\) contains the first \(k\) right singular vectors, effectively projecting from \(\mathbb{R}^d\) to \(\mathbb{R}^k\).
Quantization Strategies
Post-reduction quantization further compresses representations:
Scalar Quantization
- Map continuous values to discrete levels
- B-bit quantization: 2^B levels
- Consider Lloyd-Max quantizer for optimal level placement based on distribution
Vector Quantization
- k-means clustering in reduced space
- Codebook size determines compression ratio
- Product quantization for large-scale applications
Proposed Evaluation Methodology
Retrieval Consistency Metric
For query \(q\), let: - \(\mathcal{R}_{\text{full}}(q, k)\) = top-\(k\) retrieved chunks using full embeddings - \(\mathcal{R}_{\text{reduced}}(q, k)\) = top-\(k\) retrieved chunks using reduced embeddings
We define retrieval consistency as:
\(\text{RC@}k = \frac{|\mathcal{R}_{\text{full}}(q, k) \cap \mathcal{R}_{\text{reduced}}(q, k)|}{k}\)
Experimental Protocol
- Corpus Preparation: Segment documents into semantically coherent chunks
- Embedding Generation: Apply pre-trained encoder (e.g., BERT, Sentence-BERT, E5)
- Dimensionality Reduction:
- For PCA: Compute on training subset, apply to full corpus
- For DCT: Direct application of transformation matrix
- Quantization: Apply scalar or vector quantization to reduced embeddings
- Evaluation:
- Sample diverse query set
- Compare retrieval sets between full and reduced representations
- Measure RC@k for k ∈ {1, 5, 10, 20, 50}
- Analyze distribution of retrieval rank changes
Additional Metrics
Beyond retrieval consistency, consider: - Semantic Similarity Preservation: Correlation between cosine similarities in original vs. reduced space - Computational Efficiency: Indexing time, query latency, memory footprint - Reconstruction Error: \(\|\mathbf{E} - \mathbf{E}'\mathbf{T}^{\dagger}\|_F\) where \(\mathbf{T}^{\dagger}\) is the pseudo-inverse
Theoretical Considerations
Information-Theoretic Perspective
The fundamental question: what is the intrinsic dimensionality of semantic embeddings? Rate-distortion theory suggests embeddings contain redundancy that can be exploited.
DCT vs SVD: Key Differences
SVD: - Data-dependent transformation - Optimal linear dimensionality reduction (in Frobenius norm) - Requires singular value decomposition - Captures global variance structure
DCT: - Data-independent transformation - Assumes local smoothness in embedding space - \(O(n \log d)\) computation via FFT - Natural frequency interpretation
Implementation Considerations
Computational Complexity
- SVD: \(O(\min(n^2d, nd^2))\) for full decomposition
- DCT: \(O(nd \log d)\) via FFT implementation
- Quantization: \(O(nd)\) for scalar, \(O(ndk_c)\) for \(k_c\)-means vector quantization
Numerical Stability
When implementing PCA on high-dimensional embeddings:
# Direct SVD on embedding matrix
= np.linalg.svd(embeddings, full_matrices=False)
U, S, Vt = Vt[:k].T
projection_matrix = embeddings @ projection_matrix reduced_embeddings
Open Research Questions
- Optimal Dimensionality Selection: How to determine k without extensive empirical search?
- Query-Dependent Reduction: Can we adaptively select dimensions based on query characteristics?
- Non-linear Alternatives: Would manifold learning techniques preserve retrieval quality better?
- End-to-End Learning: Can we train embeddings aware of downstream compression?
Conclusion
This work outlines a rigorous framework for evaluating dimensionality reduction techniques in RAG systems. The trade-off between computational efficiency and retrieval quality remains an active area of research. Future work should focus on establishing theoretical guarantees and developing adaptive compression strategies.
Acknowledgments
This research was conducted in collaboration with Ethan Davis at SAS Institute Inc. We thank the SAS Institute for computational resources and support.
References
- Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space.
- Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks.
This methodology paper outlines ongoing research. Implementations and empirical results will be shared upon completion of experiments.