Files

Albert 0c4934cf70 fix: Recalculate ALL nodes for UMAP instead of incremental

Fixed critical bug where nodes 4+ wouldn't get 3D coordinates because
UMAP manifold learning requires seeing the complete dataset together.

Root Cause:
- Previous code only calculated coords for nodes WHERE coords_3d = NONE
- When creating nodes 4-5, only those 2 nodes were passed to UMAP
- UMAP requires minimum 3 points to define a manifold
- Result: "Not enough nodes to map (2/3)" error

Why Full Recalculation is Necessary:
- UMAP is a non-linear manifold learning algorithm
- It creates relative coordinates, not absolute positions
- Each UMAP run produces different coordinate systems
- No "fixed origin" exists - positions are only meaningful relative to each other
- Adding new data changes the manifold structure

Changes:
- Updated /app/api/calculate-graph/route.ts:
  * Removed "AND coords_3d = NONE" filter from query
  * Now fetches ALL nodes with embeddings every time
  * Recalculates entire graph when triggered
  * Updated comments and logging to reflect full recalculation

- Created docs/umap-recalculation-strategy.md:
  * Comprehensive explanation of UMAP manifold learning
  * Why incremental calculation doesn't work
  * Trade-offs of full recalculation approach
  * Performance characteristics (<100 nodes: <1.5s)
  * Future optimization strategies for scale

- Added scripts/recalculate-all-coords.ts:
  * Emergency script to manually fix production database
  * Successfully recalculated all 5 nodes in production

UX Impact:
The thought galaxy now "reorganizes" when adding new nodes - existing
nodes will shift slightly. This is actually a feature, showing the
evolving structure of your knowledge graph as it grows.

Performance:
Full recalculation is O(n²) but acceptable for <100 nodes:
- 3 nodes: ~50ms
- 10 nodes: ~200ms
- 50 nodes: ~800ms
- 100 nodes: ~1.5s

For Ponderants MVP, this is perfectly acceptable. Future optimizations
documented if we reach 1000+ nodes per user.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-10 01:15:27 +00:00

5.6 KiB

Raw Blame History

UMAP Recalculation Strategy

Problem Statement

When creating the 3D thought galaxy visualization, we need to convert high-dimensional AI embeddings (3072 dimensions from gemini-embedding-001) into 3D coordinates that can be displayed in the browser.

The Challenge

Question: Should we calculate coordinates incrementally (one node at a time) or recalculate ALL nodes together every time?

Initial broken approach:

-- Only calculate for nodes without coordinates
SELECT id, embedding FROM node
WHERE user_did = $userDid
  AND embedding != NONE
  AND coords_3d = NONE

This caused a bug where:

Nodes 1-3: Calculate together → ✓ Get coords
Nodes 4-5: Try to calculate separately → ✗ FAILS (only 2 points, UMAP needs 3+)

Why UMAP Requires Recalculation

What is UMAP?

UMAP (Uniform Manifold Approximation and Projection) is a non-linear manifold learning algorithm. Unlike linear methods (PCA), UMAP:

Learns the "shape" (manifold) of your data - It finds clusters, relationships, and patterns
Creates relative, not absolute coordinates - There's no fixed origin or coordinate system
Requires seeing all data together - The manifold structure changes as you add more data

Why Incremental Doesn't Work

Problem with fixed origin approach:

# Each run produces DIFFERENT coordinates!
Run 1: UMAP([node1, node2, node3]) → coords_A
Run 2: UMAP([node1, node2, node3]) → coords_B  # DIFFERENT!

# There's no absolute coordinate system
Run 1: node1 at [0.5, 0.2, 0.8]
Run 2: node1 at [2.1, -1.3, 0.4]  # Completely different!

The positions are only meaningful relative to each other. You can't have a "fixed origin" because UMAP learns a relative manifold structure.

Why you need 3+ points:

UMAP is a manifold learning algorithm
A manifold requires multiple points to define a shape
With only 1-2 points, there's no "manifold" to learn

What About UMAP.transform()?

UMAP does support an incremental transform() method:

# Fit once, save the model
umap_model = UMAP(n_components=3)
umap_model.fit(initial_embeddings)

# Transform new points into existing space
new_coords = umap_model.transform(new_embedding)

Why we're NOT using this:

Model storage complexity - Must store entire UMAP model (includes all training data) in database
Model drift - New nodes get approximate positions based on old manifold structure
Loss of quality - The manifold changes as you add data; transform() doesn't update it
Performance - For <100 nodes, full recalculation is fast (<1 second)

Our Solution: Full Recalculation

Implementation

-- Recalculate ALL nodes every time
SELECT id, embedding FROM node
WHERE user_did = $userDid
  AND embedding != NONE
-- No "coords_3d = NONE" filter!

Behavior

When you add a new node:

Fetch ALL nodes with embeddings (including those with existing coords)
Run UMAP on the complete dataset
Update ALL nodes with their recalculated positions

Result: The galaxy "reorganizes" when you add new thoughts - existing nodes WILL move slightly.

Trade-offs

Pros: ✅ Always mathematically correct ✅ Simple implementation ✅ No model storage complexity ✅ Best clustering quality (manifold adapts to new data) ✅ Fast enough for <100 nodes

Cons: ❌ Galaxy shifts when adding nodes (existing nodes move) ❌ O(n²) complexity (slower with many nodes) ❌ More database writes

Performance Characteristics

Nodes | Calculation Time | Acceptable?
------|-----------------|------------
3     | ~50ms           | ✅ Excellent
10    | ~200ms          | ✅ Great
50    | ~800ms          | ✅ Good
100   | ~1.5s           | ✅ Acceptable
500   | ~15s            | ⚠️ Slow (consider optimization)
1000+ | ~60s+           | ❌ Too slow (need incremental)

For the Ponderants MVP, we expect users to have <100 nodes, making full recalculation perfectly acceptable.

Future Optimizations

If we reach scale where recalculation becomes too slow:

Option 1: UMAP.transform() with Periodic Refitting

// Store UMAP model in database
// Transform new nodes incrementally
// Every 10 nodes: Refit the entire model
if (newNodeCount % 10 === 0) {
  recalculateAllNodes();
}

Option 2: Switch to PCA

PCA is linear and supports incremental updates
Loses UMAP's superior clustering quality
Use for very large datasets (1000+ nodes)

Option 3: Hierarchical UMAP

Cluster nodes into groups
Run UMAP on each cluster separately
Use a higher-level UMAP to arrange clusters
Complex but scales to millions of nodes

User Experience

The galaxy "reorganizing" when you add nodes is actually a feature, not a bug:

It shows your thought network evolving
New connections emerge as you add ideas
Clusters naturally form around related concepts
Creates a sense of a living, breathing knowledge graph

Users will see their constellation of thoughts naturally reorganize as their ideas grow - which aligns perfectly with the "Ponderants" brand of exploring and structuring ideas.

References

Decision Log

2025-01-10: Discovered bug where nodes 4-5 failed to get coordinates
2025-01-10: Analyzed UMAP manifold learning constraints
2025-01-10: Decided to implement full recalculation strategy
2025-01-10: Updated /app/api/calculate-graph/route.ts to remove coords_3d = NONE filter

5.6 KiB Raw Blame History