feat: Fix grapheme splitting and add automatic UMAP calculation

Critical fixes for core functionality:

1. Fixed grapheme-aware text splitting (app/api/nodes/route.ts)
   - Changed character-based substring to grapheme-ratio calculation
   - Now properly handles emojis and multi-byte characters
   - Prevents posts from exceeding 300 grapheme Bluesky limit
   - Added comprehensive logging for debugging

2. Automatic UMAP coordinate calculation (app/api/nodes/route.ts)
   - Triggers /api/calculate-graph automatically after node creation
   - Only when user has 3+ nodes with embeddings (UMAP minimum)
   - Non-blocking background process
   - Eliminates need for manual "Calculate Graph" button
   - Galaxy visualization ready on first visit

3. Simplified galaxy route (app/api/galaxy/route.ts)
   - Removed auto-trigger logic (now handled on insertion)
   - Simply returns existing coordinates
   - More efficient, no redundant calculations

4. Added idempotency (app/api/calculate-graph/route.ts)
   - Safe to call multiple times
   - Returns early if all nodes already have coordinates
   - Better logging for debugging

Implementation plans documented in /plans directory.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-09 20:19:20 +00:00
parent 6bd0fe65e2
commit d8a975122f
6 changed files with 346 additions and 39 deletions

View File

@@ -0,0 +1,189 @@
# Plan: Fix Grapheme Computation (Text Splitting)
**Priority:** HIGH - Blocking production node creation
## Current Implementation (Broken)
### Problems Identified
1. **Line 113**: Uses character length instead of grapheme length:
```typescript
testText = testText.substring(0, Math.floor(testText.length * 0.9));
```
With emojis or multi-byte chars, this can never converge properly.
2. **Variable URL lengths**: URL can be 72-112 chars depending on environment:
- `http://localhost:3000`: 72 chars
- `https://ponderants.app`: 73 chars
- `https://www.ponderants.com`: 77 chars
- `https://ponderants-dev-preview-abc123.vercel.app`: 99 chars
3. **Pre-calculates limit**: Computes `linkGraphemes` once with current URL, but doesn't account for worst-case
## Correct Algorithm
### Step 1: Calculate overhead for each post type
```typescript
const detailUrl = `${baseUrl}/galaxy/${encodeURIComponent(nodeId)}`;
const linkSuffix = `\n\nRead more: ${detailUrl}`;
const linkGraphemes = getGraphemeLength(linkSuffix);
// Thread indicator: "(N/Total) " where both N and Total can be 1-99
// Worst case: "(99/99) " = 9 characters
const threadIndicatorGraphemes = 9;
// Safety buffer to account for RichText facet detection potentially adding chars
const safetyBuffer = 5;
```
### Step 2: Calculate max graphemes for each post type
```typescript
const firstPostMaxGraphemes = 300 - linkGraphemes - safetyBuffer;
const threadPostMaxGraphemes = 300 - threadIndicatorGraphemes - safetyBuffer;
```
### Step 3: Split fullText by GRAPHEME count
```typescript
function splitByGraphemes(text: string, firstMax: number, otherMax: number): string[] {
const chunks: string[] = [];
let remainingText = text;
let isFirst = true;
while (remainingText.length > 0) {
const maxGraphemes = isFirst ? firstMax : otherMax;
const rt = new RichText({ text: remainingText });
if (rt.graphemeLength <= maxGraphemes) {
// Rest of text fits in one chunk
chunks.push(remainingText);
break;
}
// Need to split - find the split point
let testText = remainingText;
// Binary search to find the right character boundary
while (getGraphemeLength(testText) > maxGraphemes) {
// Find last word boundary before current position
const lastSpace = testText.lastIndexOf(' ');
if (lastSpace > testText.length * 0.5) {
// Good word boundary found
testText = testText.substring(0, lastSpace);
} else {
// No good word boundary - shrink by grapheme-aware amount
// Take (maxGraphemes / currentGraphemes) * currentLength
const currentGraphemes = getGraphemeLength(testText);
const ratio = maxGraphemes / currentGraphemes;
const newLength = Math.floor(testText.length * ratio * 0.95); // 0.95 for safety
testText = testText.substring(0, newLength);
}
}
chunks.push(testText.trim());
remainingText = remainingText.substring(testText.length).trim();
isFirst = false;
}
return chunks;
}
```
### Step 4: Build posts with proper grapheme validation
```typescript
const chunks = splitByGraphemes(fullText, firstPostMaxGraphemes, threadPostMaxGraphemes);
for (let i = 0; i < chunks.length; i++) {
const isFirstPost = i === 0;
let postText = chunks[i];
// Add thread indicator if needed
if (chunks.length > 1 && !isFirstPost) {
postText = `(${i + 1}/${chunks.length}) ${postText}`;
}
// Add link to first post
if (isFirstPost) {
postText += linkSuffix;
}
// Final validation
const finalGraphemes = getGraphemeLength(postText);
if (finalGraphemes > 300) {
console.error(`[POST /api/nodes] Post ${i + 1} exceeds limit: ${finalGraphemes} graphemes`);
console.error(`[POST /api/nodes] Content: ${postText.substring(0, 100)}...`);
throw new Error(`Post exceeds 300 grapheme limit: ${finalGraphemes}`);
}
// Continue with post creation...
}
```
## Implementation Steps
1. **Extract constants at the top**
- Calculate `linkGraphemes` from actual URL
- Define `threadIndicatorGraphemes = 9` (worst case)
- Define `safetyBuffer = 5`
2. **Fix splitIntoChunks function**
- Replace character-based substring with grapheme-aware splitting
- Use RichText.graphemeLength for all length checks
- When shrinking text, calculate ratio based on graphemes, not chars
3. **Add comprehensive logging**
- Log chunk grapheme counts before adding overhead
- Log final post grapheme counts
- Log URL used and its grapheme length
4. **Test edge cases**
- Long Vercel preview URLs (100+ chars)
- Text with emojis and multi-byte characters
- Text that needs 10+ chunks (thread indicators "(10/15)")
- Text exactly at boundaries
## Files to Modify
- `app/api/nodes/route.ts` - Replace `splitIntoChunks()` function
## Test Cases
### Test Case 1: Short text (fits in one post)
**Input:**
- Title: "Test"
- Body: "Short content"
- Expected: 1 post with link
### Test Case 2: Long text (needs splitting)
**Input:**
- Title: "Long Article"
- Body: 500 graphemes of text
- Expected: 2-3 posts, first with link, others with thread indicators
### Test Case 3: Text with emojis
**Input:**
- Title: "🎉 Celebration"
- Body: "Hello 👋 World 🌍" repeated to 400 graphemes
- Expected: Correct grapheme counting (emojis = 1 grapheme each)
### Test Case 4: Vercel preview URL
**Input:**
- NEXT_PUBLIC_APP_URL: `https://ponderants-git-development-abc123.vercel.app`
- Expected: URL accounts for ~100 char length
### Test Case 5: Exactly at boundary
**Input:**
- Text that's exactly 300 graphemes including link
- Expected: 1 post, no error
## Validation
After implementation, verify:
1. No posts exceed 300 graphemes
2. Splitting happens at word boundaries when possible
3. All chunks account for thread indicators
4. First post always includes detail URL
5. Works with emoji and multi-byte characters