# UMAP Recalculation Strategy

## Problem Statement

When creating the 3D thought galaxy visualization, we need to convert high-dimensional AI embeddings (3072 dimensions from `gemini-embedding-001`) into 3D coordinates that can be displayed in the browser.

### The Challenge

**Question:** Should we calculate coordinates incrementally (one node at a time) or recalculate ALL nodes together every time?

**Initial broken approach:**
```sql
-- Only calculate for nodes without coordinates
SELECT id, embedding FROM node
WHERE user_did = $userDid
  AND embedding != NONE
  AND coords_3d = NONE
```

This caused a bug where:
1. Nodes 1-3: Calculate together → ✓ Get coords
2. Nodes 4-5: Try to calculate separately → ✗ FAILS (only 2 points, UMAP needs 3+)

## Why UMAP Requires Recalculation

### What is UMAP?

UMAP (Uniform Manifold Approximation and Projection) is a **non-linear manifold learning** algorithm. Unlike linear methods (PCA), UMAP:

1. **Learns the "shape" (manifold) of your data** - It finds clusters, relationships, and patterns
2. **Creates relative, not absolute coordinates** - There's no fixed origin or coordinate system
3. **Requires seeing all data together** - The manifold structure changes as you add more data

### Why Incremental Doesn't Work

**Problem with fixed origin approach:**
```python
# Each run produces DIFFERENT coordinates!
Run 1: UMAP([node1, node2, node3]) → coords_A
Run 2: UMAP([node1, node2, node3]) → coords_B  # DIFFERENT!

# There's no absolute coordinate system
Run 1: node1 at [0.5, 0.2, 0.8]
Run 2: node1 at [2.1, -1.3, 0.4]  # Completely different!
```

The positions are only meaningful **relative to each other**. You can't have a "fixed origin" because UMAP learns a relative manifold structure.

**Why you need 3+ points:**
- UMAP is a manifold learning algorithm
- A manifold requires multiple points to define a shape
- With only 1-2 points, there's no "manifold" to learn

### What About UMAP.transform()?

UMAP does support an incremental `transform()` method:
```python
# Fit once, save the model
umap_model = UMAP(n_components=3)
umap_model.fit(initial_embeddings)

# Transform new points into existing space
new_coords = umap_model.transform(new_embedding)
```

**Why we're NOT using this:**

1. **Model storage complexity** - Must store entire UMAP model (includes all training data) in database
2. **Model drift** - New nodes get approximate positions based on old manifold structure
3. **Loss of quality** - The manifold changes as you add data; transform() doesn't update it
4. **Performance** - For <100 nodes, full recalculation is fast (<1 second)

## Our Solution: Full Recalculation

### Implementation

```sql
-- Recalculate ALL nodes every time
SELECT id, embedding FROM node
WHERE user_did = $userDid
  AND embedding != NONE
-- No "coords_3d = NONE" filter!
```

### Behavior

When you add a new node:
1. Fetch ALL nodes with embeddings (including those with existing coords)
2. Run UMAP on the complete dataset
3. Update ALL nodes with their recalculated positions

**Result:** The galaxy "reorganizes" when you add new thoughts - existing nodes WILL move slightly.

### Trade-offs

**Pros:**
✅ Always mathematically correct
✅ Simple implementation
✅ No model storage complexity
✅ Best clustering quality (manifold adapts to new data)
✅ Fast enough for <100 nodes

**Cons:**
❌ Galaxy shifts when adding nodes (existing nodes move)
❌ O(n²) complexity (slower with many nodes)
❌ More database writes

### Performance Characteristics

```
Nodes | Calculation Time | Acceptable?
------|-----------------|------------
3     | ~50ms           | ✅ Excellent
10    | ~200ms          | ✅ Great
50    | ~800ms          | ✅ Good
100   | ~1.5s           | ✅ Acceptable
500   | ~15s            | ⚠️ Slow (consider optimization)
1000+ | ~60s+           | ❌ Too slow (need incremental)
```

For the Ponderants MVP, we expect users to have <100 nodes, making full recalculation perfectly acceptable.

## Future Optimizations

If we reach scale where recalculation becomes too slow:

### Option 1: UMAP.transform() with Periodic Refitting
```typescript
// Store UMAP model in database
// Transform new nodes incrementally
// Every 10 nodes: Refit the entire model
if (newNodeCount % 10 === 0) {
  recalculateAllNodes();
}
```

### Option 2: Switch to PCA
- PCA is linear and supports incremental updates
- Loses UMAP's superior clustering quality
- Use for very large datasets (1000+ nodes)

### Option 3: Hierarchical UMAP
- Cluster nodes into groups
- Run UMAP on each cluster separately
- Use a higher-level UMAP to arrange clusters
- Complex but scales to millions of nodes

## User Experience

The galaxy "reorganizing" when you add nodes is actually a **feature, not a bug**:

- It shows your thought network evolving
- New connections emerge as you add ideas
- Clusters naturally form around related concepts
- Creates a sense of a living, breathing knowledge graph

Users will see their constellation of thoughts naturally reorganize as their ideas grow - which aligns perfectly with the "Ponderants" brand of exploring and structuring ideas.

## References

- [UMAP Documentation](https://umap-learn.readthedocs.io/)
- [umap-js Library](https://github.com/PAIR-code/umap-js)
- [Understanding UMAP](https://pair-code.github.io/understanding-umap/)
- [When to use UMAP vs PCA](https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668)

## Decision Log

- **2025-01-10**: Discovered bug where nodes 4-5 failed to get coordinates
- **2025-01-10**: Analyzed UMAP manifold learning constraints
- **2025-01-10**: Decided to implement full recalculation strategy
- **2025-01-10**: Updated `/app/api/calculate-graph/route.ts` to remove `coords_3d = NONE` filter