# UMAP Recalculation Strategy ## Problem Statement When creating the 3D thought galaxy visualization, we need to convert high-dimensional AI embeddings (3072 dimensions from `gemini-embedding-001`) into 3D coordinates that can be displayed in the browser. ### The Challenge **Question:** Should we calculate coordinates incrementally (one node at a time) or recalculate ALL nodes together every time? **Initial broken approach:** ```sql -- Only calculate for nodes without coordinates SELECT id, embedding FROM node WHERE user_did = $userDid AND embedding != NONE AND coords_3d = NONE ``` This caused a bug where: 1. Nodes 1-3: Calculate together → ✓ Get coords 2. Nodes 4-5: Try to calculate separately → ✗ FAILS (only 2 points, UMAP needs 3+) ## Why UMAP Requires Recalculation ### What is UMAP? UMAP (Uniform Manifold Approximation and Projection) is a **non-linear manifold learning** algorithm. Unlike linear methods (PCA), UMAP: 1. **Learns the "shape" (manifold) of your data** - It finds clusters, relationships, and patterns 2. **Creates relative, not absolute coordinates** - There's no fixed origin or coordinate system 3. **Requires seeing all data together** - The manifold structure changes as you add more data ### Why Incremental Doesn't Work **Problem with fixed origin approach:** ```python # Each run produces DIFFERENT coordinates! Run 1: UMAP([node1, node2, node3]) → coords_A Run 2: UMAP([node1, node2, node3]) → coords_B # DIFFERENT! # There's no absolute coordinate system Run 1: node1 at [0.5, 0.2, 0.8] Run 2: node1 at [2.1, -1.3, 0.4] # Completely different! ``` The positions are only meaningful **relative to each other**. You can't have a "fixed origin" because UMAP learns a relative manifold structure. **Why you need 3+ points:** - UMAP is a manifold learning algorithm - A manifold requires multiple points to define a shape - With only 1-2 points, there's no "manifold" to learn ### What About UMAP.transform()? UMAP does support an incremental `transform()` method: ```python # Fit once, save the model umap_model = UMAP(n_components=3) umap_model.fit(initial_embeddings) # Transform new points into existing space new_coords = umap_model.transform(new_embedding) ``` **Why we're NOT using this:** 1. **Model storage complexity** - Must store entire UMAP model (includes all training data) in database 2. **Model drift** - New nodes get approximate positions based on old manifold structure 3. **Loss of quality** - The manifold changes as you add data; transform() doesn't update it 4. **Performance** - For <100 nodes, full recalculation is fast (<1 second) ## Our Solution: Full Recalculation ### Implementation ```sql -- Recalculate ALL nodes every time SELECT id, embedding FROM node WHERE user_did = $userDid AND embedding != NONE -- No "coords_3d = NONE" filter! ``` ### Behavior When you add a new node: 1. Fetch ALL nodes with embeddings (including those with existing coords) 2. Run UMAP on the complete dataset 3. Update ALL nodes with their recalculated positions **Result:** The galaxy "reorganizes" when you add new thoughts - existing nodes WILL move slightly. ### Trade-offs **Pros:** ✅ Always mathematically correct ✅ Simple implementation ✅ No model storage complexity ✅ Best clustering quality (manifold adapts to new data) ✅ Fast enough for <100 nodes **Cons:** ❌ Galaxy shifts when adding nodes (existing nodes move) ❌ O(n²) complexity (slower with many nodes) ❌ More database writes ### Performance Characteristics ``` Nodes | Calculation Time | Acceptable? ------|-----------------|------------ 3 | ~50ms | ✅ Excellent 10 | ~200ms | ✅ Great 50 | ~800ms | ✅ Good 100 | ~1.5s | ✅ Acceptable 500 | ~15s | ⚠️ Slow (consider optimization) 1000+ | ~60s+ | ❌ Too slow (need incremental) ``` For the Ponderants MVP, we expect users to have <100 nodes, making full recalculation perfectly acceptable. ## Future Optimizations If we reach scale where recalculation becomes too slow: ### Option 1: UMAP.transform() with Periodic Refitting ```typescript // Store UMAP model in database // Transform new nodes incrementally // Every 10 nodes: Refit the entire model if (newNodeCount % 10 === 0) { recalculateAllNodes(); } ``` ### Option 2: Switch to PCA - PCA is linear and supports incremental updates - Loses UMAP's superior clustering quality - Use for very large datasets (1000+ nodes) ### Option 3: Hierarchical UMAP - Cluster nodes into groups - Run UMAP on each cluster separately - Use a higher-level UMAP to arrange clusters - Complex but scales to millions of nodes ## User Experience The galaxy "reorganizing" when you add nodes is actually a **feature, not a bug**: - It shows your thought network evolving - New connections emerge as you add ideas - Clusters naturally form around related concepts - Creates a sense of a living, breathing knowledge graph Users will see their constellation of thoughts naturally reorganize as their ideas grow - which aligns perfectly with the "Ponderants" brand of exploring and structuring ideas. ## References - [UMAP Documentation](https://umap-learn.readthedocs.io/) - [umap-js Library](https://github.com/PAIR-code/umap-js) - [Understanding UMAP](https://pair-code.github.io/understanding-umap/) - [When to use UMAP vs PCA](https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668) ## Decision Log - **2025-01-10**: Discovered bug where nodes 4-5 failed to get coordinates - **2025-01-10**: Analyzed UMAP manifold learning constraints - **2025-01-10**: Decided to implement full recalculation strategy - **2025-01-10**: Updated `/app/api/calculate-graph/route.ts` to remove `coords_3d = NONE` filter