feat: Fix grapheme splitting and add automatic UMAP calculation
Critical fixes for core functionality: 1. Fixed grapheme-aware text splitting (app/api/nodes/route.ts) - Changed character-based substring to grapheme-ratio calculation - Now properly handles emojis and multi-byte characters - Prevents posts from exceeding 300 grapheme Bluesky limit - Added comprehensive logging for debugging 2. Automatic UMAP coordinate calculation (app/api/nodes/route.ts) - Triggers /api/calculate-graph automatically after node creation - Only when user has 3+ nodes with embeddings (UMAP minimum) - Non-blocking background process - Eliminates need for manual "Calculate Graph" button - Galaxy visualization ready on first visit 3. Simplified galaxy route (app/api/galaxy/route.ts) - Removed auto-trigger logic (now handled on insertion) - Simply returns existing coordinates - More efficient, no redundant calculations 4. Added idempotency (app/api/calculate-graph/route.ts) - Safe to call multiple times - Returns early if all nodes already have coordinates - Better logging for debugging Implementation plans documented in /plans directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
189
plans/fix-grapheme-splitting.md
Normal file
189
plans/fix-grapheme-splitting.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Plan: Fix Grapheme Computation (Text Splitting)
|
||||
|
||||
**Priority:** HIGH - Blocking production node creation
|
||||
|
||||
## Current Implementation (Broken)
|
||||
|
||||
### Problems Identified
|
||||
|
||||
1. **Line 113**: Uses character length instead of grapheme length:
|
||||
```typescript
|
||||
testText = testText.substring(0, Math.floor(testText.length * 0.9));
|
||||
```
|
||||
With emojis or multi-byte chars, this can never converge properly.
|
||||
|
||||
2. **Variable URL lengths**: URL can be 72-112 chars depending on environment:
|
||||
- `http://localhost:3000`: 72 chars
|
||||
- `https://ponderants.app`: 73 chars
|
||||
- `https://www.ponderants.com`: 77 chars
|
||||
- `https://ponderants-dev-preview-abc123.vercel.app`: 99 chars
|
||||
|
||||
3. **Pre-calculates limit**: Computes `linkGraphemes` once with current URL, but doesn't account for worst-case
|
||||
|
||||
## Correct Algorithm
|
||||
|
||||
### Step 1: Calculate overhead for each post type
|
||||
|
||||
```typescript
|
||||
const detailUrl = `${baseUrl}/galaxy/${encodeURIComponent(nodeId)}`;
|
||||
const linkSuffix = `\n\nRead more: ${detailUrl}`;
|
||||
const linkGraphemes = getGraphemeLength(linkSuffix);
|
||||
|
||||
// Thread indicator: "(N/Total) " where both N and Total can be 1-99
|
||||
// Worst case: "(99/99) " = 9 characters
|
||||
const threadIndicatorGraphemes = 9;
|
||||
|
||||
// Safety buffer to account for RichText facet detection potentially adding chars
|
||||
const safetyBuffer = 5;
|
||||
```
|
||||
|
||||
### Step 2: Calculate max graphemes for each post type
|
||||
|
||||
```typescript
|
||||
const firstPostMaxGraphemes = 300 - linkGraphemes - safetyBuffer;
|
||||
const threadPostMaxGraphemes = 300 - threadIndicatorGraphemes - safetyBuffer;
|
||||
```
|
||||
|
||||
### Step 3: Split fullText by GRAPHEME count
|
||||
|
||||
```typescript
|
||||
function splitByGraphemes(text: string, firstMax: number, otherMax: number): string[] {
|
||||
const chunks: string[] = [];
|
||||
let remainingText = text;
|
||||
let isFirst = true;
|
||||
|
||||
while (remainingText.length > 0) {
|
||||
const maxGraphemes = isFirst ? firstMax : otherMax;
|
||||
const rt = new RichText({ text: remainingText });
|
||||
|
||||
if (rt.graphemeLength <= maxGraphemes) {
|
||||
// Rest of text fits in one chunk
|
||||
chunks.push(remainingText);
|
||||
break;
|
||||
}
|
||||
|
||||
// Need to split - find the split point
|
||||
let testText = remainingText;
|
||||
|
||||
// Binary search to find the right character boundary
|
||||
while (getGraphemeLength(testText) > maxGraphemes) {
|
||||
// Find last word boundary before current position
|
||||
const lastSpace = testText.lastIndexOf(' ');
|
||||
if (lastSpace > testText.length * 0.5) {
|
||||
// Good word boundary found
|
||||
testText = testText.substring(0, lastSpace);
|
||||
} else {
|
||||
// No good word boundary - shrink by grapheme-aware amount
|
||||
// Take (maxGraphemes / currentGraphemes) * currentLength
|
||||
const currentGraphemes = getGraphemeLength(testText);
|
||||
const ratio = maxGraphemes / currentGraphemes;
|
||||
const newLength = Math.floor(testText.length * ratio * 0.95); // 0.95 for safety
|
||||
testText = testText.substring(0, newLength);
|
||||
}
|
||||
}
|
||||
|
||||
chunks.push(testText.trim());
|
||||
remainingText = remainingText.substring(testText.length).trim();
|
||||
isFirst = false;
|
||||
}
|
||||
|
||||
return chunks;
|
||||
}
|
||||
```
|
||||
|
||||
### Step 4: Build posts with proper grapheme validation
|
||||
|
||||
```typescript
|
||||
const chunks = splitByGraphemes(fullText, firstPostMaxGraphemes, threadPostMaxGraphemes);
|
||||
|
||||
for (let i = 0; i < chunks.length; i++) {
|
||||
const isFirstPost = i === 0;
|
||||
let postText = chunks[i];
|
||||
|
||||
// Add thread indicator if needed
|
||||
if (chunks.length > 1 && !isFirstPost) {
|
||||
postText = `(${i + 1}/${chunks.length}) ${postText}`;
|
||||
}
|
||||
|
||||
// Add link to first post
|
||||
if (isFirstPost) {
|
||||
postText += linkSuffix;
|
||||
}
|
||||
|
||||
// Final validation
|
||||
const finalGraphemes = getGraphemeLength(postText);
|
||||
if (finalGraphemes > 300) {
|
||||
console.error(`[POST /api/nodes] Post ${i + 1} exceeds limit: ${finalGraphemes} graphemes`);
|
||||
console.error(`[POST /api/nodes] Content: ${postText.substring(0, 100)}...`);
|
||||
throw new Error(`Post exceeds 300 grapheme limit: ${finalGraphemes}`);
|
||||
}
|
||||
|
||||
// Continue with post creation...
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. **Extract constants at the top**
|
||||
- Calculate `linkGraphemes` from actual URL
|
||||
- Define `threadIndicatorGraphemes = 9` (worst case)
|
||||
- Define `safetyBuffer = 5`
|
||||
|
||||
2. **Fix splitIntoChunks function**
|
||||
- Replace character-based substring with grapheme-aware splitting
|
||||
- Use RichText.graphemeLength for all length checks
|
||||
- When shrinking text, calculate ratio based on graphemes, not chars
|
||||
|
||||
3. **Add comprehensive logging**
|
||||
- Log chunk grapheme counts before adding overhead
|
||||
- Log final post grapheme counts
|
||||
- Log URL used and its grapheme length
|
||||
|
||||
4. **Test edge cases**
|
||||
- Long Vercel preview URLs (100+ chars)
|
||||
- Text with emojis and multi-byte characters
|
||||
- Text that needs 10+ chunks (thread indicators "(10/15)")
|
||||
- Text exactly at boundaries
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- `app/api/nodes/route.ts` - Replace `splitIntoChunks()` function
|
||||
|
||||
## Test Cases
|
||||
|
||||
### Test Case 1: Short text (fits in one post)
|
||||
**Input:**
|
||||
- Title: "Test"
|
||||
- Body: "Short content"
|
||||
- Expected: 1 post with link
|
||||
|
||||
### Test Case 2: Long text (needs splitting)
|
||||
**Input:**
|
||||
- Title: "Long Article"
|
||||
- Body: 500 graphemes of text
|
||||
- Expected: 2-3 posts, first with link, others with thread indicators
|
||||
|
||||
### Test Case 3: Text with emojis
|
||||
**Input:**
|
||||
- Title: "🎉 Celebration"
|
||||
- Body: "Hello 👋 World 🌍" repeated to 400 graphemes
|
||||
- Expected: Correct grapheme counting (emojis = 1 grapheme each)
|
||||
|
||||
### Test Case 4: Vercel preview URL
|
||||
**Input:**
|
||||
- NEXT_PUBLIC_APP_URL: `https://ponderants-git-development-abc123.vercel.app`
|
||||
- Expected: URL accounts for ~100 char length
|
||||
|
||||
### Test Case 5: Exactly at boundary
|
||||
**Input:**
|
||||
- Text that's exactly 300 graphemes including link
|
||||
- Expected: 1 post, no error
|
||||
|
||||
## Validation
|
||||
|
||||
After implementation, verify:
|
||||
1. No posts exceed 300 graphemes
|
||||
2. Splitting happens at word boundaries when possible
|
||||
3. All chunks account for thread indicators
|
||||
4. First post always includes detail URL
|
||||
5. Works with emoji and multi-byte characters
|
||||
Reference in New Issue
Block a user