← Back to Blog

Building a Real-Time Collaborative Editor with Operational Transform in Node.js

Try the Collaborative Editor →
Share

The Convergence Problem

Two users are editing the same document simultaneously. User A types "Hello" at position 0. User B types "World" at position 0. Both users started from the same empty document. If we apply both operations naively in the order they arrive, the result depends on network timing, and neither user sees what they intended.

Why Naive Application Fails
Initial document: ""  (empty)

User A: Insert("Hello", pos=0)    User B: Insert("World", pos=0)
         |                                  |
         v                                  v
A's local doc: "Hello"            B's local doc: "World"

Now A receives B's op: Insert("World", pos=0)
A applies it:  "World" + "Hello" = "WorldHello"

Now B receives A's op: Insert("Hello", pos=0)
B applies it:  "Hello" + "World" = "HelloWorld"

  A sees: "WorldHello"
  B sees: "HelloWorld"

Documents have DIVERGED. This is permanent.
Every subsequent edit makes divergence worse.

This is the convergence problem, and it is the fundamental challenge of collaborative editing. The solution that Google Docs, Figma (for text), and most production collaborative editors use is Operational Transform (OT). I implemented it from scratch to understand exactly how it works.

Operational Transform: The Core Idea

OT resolves concurrent edits through a mathematical transformation. Given two operations A and B that were created concurrently (both based on the same document state), the transform function produces modified versions A' and B' such that applying A then B' produces the same document as applying B then A'. This is called the transformation property, or TP1.

The Transformation Property (TP1)
         Document State S
           /         \
          /           \
    Apply A           Apply B
        /               \
       v                 v
   State S*A         State S*B
       \                 /
        \               /
    Apply B'       Apply A'
          \           /
           v         v
      State S*A*B' = State S*B*A'

  transform(A, B) = (A', B')

  Guarantee: S*A*B' === S*B*A'
  (Same final document regardless of application order)

The operation model for a plain-text editor needs only two operation types: Insert(text, position) and Delete(position, count). Every keystroke, paste, cut, and backspace maps to one of these two primitives. The transform function must handle all four combinations: Insert vs Insert, Insert vs Delete, Delete vs Insert, and Delete vs Delete.

The Transform Function: All Four Cases

Each case requires careful positional arithmetic. The core principle: when one operation shifts characters in the document, the other operation's position must be adjusted to account for that shift. Let us walk through each case.

Case 1: Insert vs Insert

When both operations insert text, their positions may need adjustment. If A inserts before B's position, B must shift right by A's text length (since A's text now occupies space before B's target). If B inserts before A, the reverse happens. When positions are equal, a deterministic tiebreaker is needed. This implementation uses server priority: the operation from the server wins the tie and keeps its position.

Insert vs Insert: Position Adjustment
Document: "The cat"
           0123456

A: Insert("big ", pos=4)     B: Insert("fat ", pos=4)

Without transform (naive):
  Apply A: "The big cat"    Apply B: "The fat cat"
  Apply B at pos=4:          Apply A at pos=4:
  "The fat big cat"          "The big fat cat"   <-- DIVERGED

With transform(A, B):
  A inserts at 4, B inserts at 4 (tie!)
  A wins tie (server priority).
  A' = Insert("big ", pos=4)  (unchanged)
  B' = Insert("fat ", pos=8)  (shifted right by len("big ")=4)

  Apply A then B':  "The big cat" -> "The big fat cat"
  Apply B then A':  "The fat cat" -> "The big fat cat"   CONVERGED

Case 2: Insert vs Delete

When one operation inserts and the other deletes, three sub-cases arise based on the relative positions of the insert point and the delete range.

Insert vs Delete: Three Sub-cases
Document: "ABCDEFGH"
           01234567

Sub-case 1: Insert BEFORE delete range
  A: Insert("X", pos=1)    B: Delete(pos=4, count=3) [DEF]
  A' = unchanged            B' = Delete(pos=5, count=3)
  (B shifts right by 1 because A added a character before it)

Sub-case 2: Insert AFTER delete range
  A: Insert("X", pos=7)    B: Delete(pos=2, count=3) [CDE]
  A' = Insert("X", pos=4)  B' = unchanged
  (A shifts left by 3 because B removed 3 characters before it)

Sub-case 3: Insert WITHIN delete range
  A: Insert("X", pos=5)    B: Delete(pos=3, count=4) [DEFG]
  A' = Insert("X", pos=3)  B' = Delete(pos=3, count=4+1)
  (Insert moves to delete start; delete expands to cover
   the new character too, or the insert escapes the range)

Sub-case 3 is the trickiest and has multiple valid resolutions. This implementation moves the insert to the delete's start position and adjusts the delete count. The rationale: the user intended to insert at a specific location that no longer exists, so placing the text at the nearest surviving position preserves intent best.

Case 3: Delete vs Delete

When both operations delete text, their ranges may overlap. The key challenge is computing the intersection of two ranges and adjusting both operations to avoid double-deleting the overlapping region.

Delete vs Delete: Overlapping Ranges
Document: "ABCDEFGHIJ"
           0123456789

A: Delete(pos=2, count=4) [CDEF]     B: Delete(pos=4, count=4) [EFGH]

  Ranges:  A = [2, 6)     B = [4, 8)
  Overlap: [4, 6)  (2 chars: E, F)

  After transform:
  A' = Delete(pos=2, count=2)   [CD only, EF handled by B]
  B' = Delete(pos=2, count=2)   [GH only, shifted left by 2]

  Apply A then B':  "ABCDEFGHIJ" -> "ABGHIJ" -> "ABIJ"
  Apply B then A':  "ABCDEFGHIJ" -> "ABCDIJ" -> "ABIJ"  CONVERGED

  No overlap case:
  A: Delete(pos=1, count=2) [BC]    B: Delete(pos=6, count=2) [GH]
  A' = unchanged                     B' = Delete(pos=4, count=2)
  (B shifts left by 2 because A removed chars before it)

When deletes completely overlap (one range is a subset of the other), the smaller delete becomes a no-op with count=0. This is correct because the characters it targeted have already been removed by the larger delete.

Client-Side State Machine

The client cannot simply send every keystroke to the server and wait for acknowledgment before allowing the next edit. That would make typing feel laggy. Instead, the client uses a two-state machine that allows optimistic local edits while maintaining synchronization with the server.

Client OT State Machine
+------------------+         User edits         +------------------+
|                  | --------------------------> |                  |
|   SYNCHRONIZED   |   Send op immediately       |   AWAITING_ACK   |
|                  |   inflightOp = op           |                  |
|  inflightOp=null | <-------------------------- |  inflightOp=op   |
|  buffer=[]       |   Receive own ack           |  buffer=[...]    |
|                  |   Flush buffer              |                  |
+------------------+                             +------------------+
        |                                               |     |
        |    Remote op arrives                           |     |
        |    Transform against nothing                   |     |
        |    Apply to doc                                |     |
        v                                               |     v
  [Apply remote]                              User edits while
                                              waiting for ack:
                                              Buffer the new op
                                              Apply locally (optimistic)
                                                      |
                                              Remote op arrives:
                                              1. Transform inflight vs remote
                                              2. Transform each buffer op vs remote
                                              3. Apply transformed remote to doc
                                              4. Adjust cursor positions

In the SYNCHRONIZED state, the client has no pending operations. When the user types, the operation is sent immediately to the server, and the client transitions to AWAITING_ACK. In this state, any new user edits are buffered locally (applied optimistically to the textarea) but not sent yet.

When the server broadcasts back the client's own operation (identified by matching userId), the client treats it as an acknowledgment. It flushes the buffer: if buffer is non-empty, send the first buffered op and keep the rest buffered. If buffer is empty, transition back to SYNCHRONIZED.

The critical scenario is when a remote operation arrives while the client is in AWAITING_ACK. The remote op was concurrent with the inflight op, so it must be transformed. First, transform the inflight op against the remote op. Then transform every buffered op against the (progressively transformed) remote op. Finally, apply the fully transformed remote op to the local document. This chain of transforms ensures the client's local state converges with the server.

Server Authority Model

The server maintains the canonical document state and assigns monotonically increasing revision numbers. Every client operation includes the revision it was based on. The server transforms the incoming operation against all operations that have occurred since that revision, applies the transformed result, increments the revision, and broadcasts to all connected clients (including the sender).

Server-Side Operation Processing
Server State:
  document = "Hello World"
  revision = 5
  history = [
    { rev: 1, op: Insert("H",0),      userId: A },
    { rev: 2, op: Insert("ello",1),    userId: A },
    { rev: 3, op: Insert(" ",5),       userId: B },
    { rev: 4, op: Insert("Wor",6),     userId: B },
    { rev: 5, op: Insert("ld",9),      userId: B },
  ]

Client A sends: { op: Delete(pos=5, count=1), revision: 3 }
  (A thinks document is "Hello" at rev 3, deleting the space)

Server processing:
  1. Find ops since rev 3: [rev:4, rev:5]
  2. Transform A's delete against rev:4 (Insert("Wor",6)):
     Delete at 5, Insert at 6 --> no shift needed
     Result: Delete(pos=5, count=1)
  3. Transform against rev:5 (Insert("ld",9)):
     Delete at 5, Insert at 9 --> no shift needed
     Result: Delete(pos=5, count=1)
  4. Apply to document: "Hello World" -> "HelloWorld"
  5. revision = 6
  6. Broadcast: { op: Delete(5,1), rev: 6, userId: A }

The history buffer has a configurable maximum length (1,000 operations). If a client's revision is older than the oldest entry in the history, the server cannot transform its operation and responds with an error, requiring the client to rejoin and receive a fresh document snapshot. This prevents unbounded memory growth from the history log.

WebSocket Protocol Design

The implementation uses the raw ws library instead of Socket.IO. This is intentional: ws gives direct control over the WebSocket protocol, message framing, and connection lifecycle without abstraction overhead. The protocol is a simple JSON message format with a type discriminator.

WebSocket Message Protocol
CLIENT -> SERVER                    SERVER -> CLIENT
+-------------------------------+   +-------------------------------+
| join                          |   | joined                        |
|   roomId: string              |   |   roomId: string              |
|   userName: string            |   |   userId: string (assigned)   |
+-------------------------------+   |   document: string (full)     |
                                    |   revision: number            |
+-------------------------------+   |   users: UserPresence[]       |
| operation                     |   +-------------------------------+
|   roomId: string              |   | operation                     |
|   op: Insert | Delete         |   |   op: Insert | Delete         |
|   revision: number            |   |   revision: number            |
+-------------------------------+   |   userId: string              |
                                    +-------------------------------+
+-------------------------------+   | cursor                        |
| cursor                        |   |   userId: string              |
|   roomId: string              |   |   position: {line,col,offset} |
|   position: {line,col,offset} |   |   color: string               |
+-------------------------------+   +-------------------------------+
                                    | user_joined                   |
+-------------------------------+   |   user: UserPresence          |
| leave                         |   +-------------------------------+
|   roomId: string              |   | user_left                     |
+-------------------------------+   |   userId: string              |
                                    +-------------------------------+
                                    | error                         |
                                    |   message: string             |
                                    +-------------------------------+

The join message triggers room creation or joining. The server responds with a joined message containing the full document content, the current revision number, and the list of all connected users with their cursor positions and colors. This gives the new client everything it needs to initialize its local state.

Operation messages from the client include the revision number the operation was based on. This is the critical piece that allows the server to determine which history entries to transform against. The server broadcasts operations to all clients including the sender, which uses the userId field to distinguish between acknowledgments (own userId) and remote operations.

Complete Message Flow: Two Users Editing

End-to-End Editing Sequence
User A                     Server (rev=0)                 User B
  |                           |                              |
  |--- join(room, "Alice") -->|                              |
  |<-- joined(doc="",rev=0) --|                              |
  |                           |<-- join(room, "Bob") --------|  
  |                           |--- joined(doc="",rev=0) ---->|
  |<-- user_joined(Bob) -----|                              |
  |                           |                              |
  | Types "Hi"                |                              |
  | Local: "Hi"               |                              |
  |--- op(Insert("Hi",0),r=0)->|                              |
  |                           | Transform against [] (empty) |
  |                           | doc = "Hi", rev = 1          |
  |                           |--- op(Insert,r=1,uA) ------->|
  |<-- op(Insert,r=1,uA) ----|                              |
  | userId=A = self -> ACK    |              B applies remote|
  | rev = 1                   |              doc = "Hi"      |
  |                           |              rev = 1         |
  |                           |                              |
  |                           |          B types " there" at 2|
  |                           |          Local: "Hi there"   |
  |                           |<-- op(Insert(" there",2),r=1)|
  |                           | Transform against [] (empty) |
  |                           | doc = "Hi there", rev = 2    |
  |<-- op(Insert,r=2,uB) ----|--- op(Insert,r=2,uB) ------->|
  | Remote from B             |              userId=B = ACK  |
  | Transform against nothing |              rev = 2         |
  | doc = "Hi there"          |                              |
  | rev = 2                   |                              |

Conflict Resolution: Concurrent Operations

The interesting case happens when both users edit simultaneously, meaning both send operations based on the same revision before receiving the other's update.

Concurrent Edit Resolution
User A                     Server (rev=2, doc="Hi there")   User B
  |                           |                              |
  | Types "!"                 |                Types "?"     |
  | Insert("!",8), rev=2      |     Insert("?",8), rev=2     |
  | Local: "Hi there!"        |     Local: "Hi there?"       |
  |                           |                              |
  |--- op(Insert("!",8),r=2)->|                              |
  |                           |<-- op(Insert("?",8),r=2) ---|  
  |                           |                              |
  | Server processes A first: |                              |
  | Transform against []: no change                          |
  | doc = "Hi there!", rev=3  |                              |
  | Broadcast: Insert("!",8), rev=3, userId=A                |
  |                           |                              |
  | Server processes B:       |                              |
  | B says rev=2, but server is at 3                         |
  | History since rev 2: [Insert("!",8) from A]              |
  | Transform B against A's op:                              |
  |   Insert("?",8) vs Insert("!",8)                         |
  |   Same position! A wins (server priority)                |
  |   B' = Insert("?", 8+1) = Insert("?",9)                 |
  | doc = "Hi there!?", rev=4 |                              |
  | Broadcast: Insert("?",9), rev=4, userId=B                |
  |                           |                              |
  |<-- op(Insert(!),r=3,uA) -|--- op(Insert(!),r=3,uA) --->|
  | Own ACK, rev=3            |   B receives remote from A   |
  |                           |   B has inflight: Insert(?,8)|
  |                           |   Transform inflight vs remote|
  |                           |   Insert(?,8) vs Insert(!,8) |
  |                           |   A wins -> B' = Insert(?,9) |
  |                           |   Apply remote: "Hi there?"  |
  |                           |   -> "Hi there!?" (insert ! at 8)
  |                           |                              |
  |<-- op(Insert(?),r=4,uB) -|--- op(Insert(?),r=4,uB) --->|
  | Remote from B             |   Own ACK, rev=4             |
  | Apply Insert(?,9)         |   Flush buffer (empty)       |
  | doc = "Hi there!?"        |   doc = "Hi there!?"         |
  |                           |                              |
  | BOTH SEE: "Hi there!?"   |   CONVERGED                  |

Notice how the server and both clients arrive at the same document "Hi there!?" despite the operations being concurrent. The transform function and revision-based history make this deterministic. The tiebreaker (server priority when positions are equal) ensures there is exactly one valid outcome.

Cursor Presence System

Real-time cursor presence gives users awareness of where others are editing. Each user is assigned a color from a predefined palette when they join a room. Cursor positions are sent as WebSocket messages containing the line, column, and character offset within the document.

Cursor Position Tracking and Rendering
Step 1: Capture cursor from textarea
  textarea.selectionStart = 15  (character offset)
  Split text before offset into lines:
  "Hello World\nFoo" -> line=1, column=3

Step 2: Throttle at 100ms
  if (now - lastSend < 100ms) return;
  Send: { type: 'cursor', roomId, position: {line:1, col:3, offset:15} }

Step 3: Server broadcasts to other clients
  { type: 'cursor', userId: 'abc', position: {...}, color: '#ff6b6b' }

Step 4: Adjust cursors on operations
  Remote op: Insert("XY", pos=10)
  Cursor at offset 15:
    10 < 15, so shift: 15 + 2 = 17
  Cursor at offset 8:
    10 > 8, no shift: stays at 8

Step 5: Render overlay using mirror div
  +---------------------------------+
  | Hello World                     |
  | Fo|o bar baz                    |  <-- Alice's cursor (teal)
  |   ^                             |
  |   Alice                         |  <-- Label with color
  | Some more text here|            |  <-- Bob's cursor (pink)
  |                    ^            |
  |                    Bob          |
  +---------------------------------+

  Mirror div technique:
  1. Create invisible div with same font/size as textarea
  2. Copy text up to cursor position
  3. Add <span> marker at cursor
  4. Get marker's bounding rect -> pixel coordinates
  5. Position colored cursor overlay at those coordinates

Cursor positions are ephemeral and not persisted. They exist only in the PresenceService's in-memory Map. When a user disconnects, their cursor is removed from all other clients via a user_left message. The 100ms throttle on cursor updates is important: without it, cursor sync generates more WebSocket traffic than actual document edits, especially during rapid typing or mouse selection.

Room Lifecycle and Persistence

The RoomManager handles room creation, joining, leaving, and cleanup. Rooms are created on demand when the first user joins. Each room has its own DocumentState instance that tracks the document content, revision counter, and operation history.

Room Lifecycle State Machine
                   User creates room
                        |
                        v
              +-------------------+
              |     ACTIVE        |
              |                   |
              |  document: ""     |
              |  revision: 0     |
              |  users: [Alice]  |
              +-------------------+
                |             |
     User joins |             | All users leave
     (any time) |             |
                v             v
              +-------------------+
              |  ACTIVE (multi)   |            +-------------------+
              |  users: [A, B, C] | ---------> |  CLEANUP_PENDING  |
              +-------------------+  last user |                   |
                |                    leaves    |  30-min timer     |
                |                              |  starts           |
                |  Every 60 seconds:           +-------------------+
                |  Snapshot to JSON file              |         |
                |  (collab-snapshots/<roomId>.json)    |         |
                |                               User   |  Timer  |
                |                               rejoins|  expires|
                |                                  |         |
                |                                  v         v
              +-------------------+        Back to    +----------+
              |  SNAPSHOT FILE    |        ACTIVE     | DELETED  |
              |                   | <------+          |          |
              |  Restored on      |                   | Snapshot |
              |  server restart   |                   | removed  |
              |  or rejoin        |                   | Room GC'd|
              +-------------------+                   +----------+

The snapshot system serializes the document state every 60 seconds to a JSON file in the collab-snapshots directory. Each snapshot includes the room ID, name, current document content, revision number, the last 100 operations from history, and a timestamp. When the server restarts or a user rejoins an empty room, the system checks for a snapshot and restores from it.

Room cleanup uses a delayed garbage collection strategy. When the last user leaves, a 30-minute timer starts. If no one rejoins within that window, the room is deleted from memory and its snapshot is removed. If someone does rejoin, the timer is cancelled and the room resumes from its last state. This prevents rooms from accumulating indefinitely while giving users a grace period to reconnect after network issues.

Server-Side Architecture

Backend Component Relationships
WebSocket Connection (ws://localhost:3001/ws/collab)
         |
         v
+---------------------+
| WebSocketServer     |  Manages connections and message routing
|                     |
|  clients:           |  Map<WebSocket, { userId, roomId }>
|  roomClients:       |  Map<roomId, Set<WebSocket>>
|                     |
|  handleJoin()    ---+----> RoomManager.joinRoom()
|  handleOperation()--+----> RoomManager.handleOperation()
|  handleCursor()  ---+----> broadcasts directly
|  handleLeave()   ---+----> RoomManager.leaveRoom()
+---------------------+
         |
         v
+---------------------+
| RoomManager         |  Manages room lifecycle
|                     |
|  documents:         |  Map<roomId, DocumentState>
|  presence:          |  PresenceService (user tracking)
|  cleanupTimers:     |  Map<roomId, timeout>
|                     |
|  joinRoom()      ---+----> DocumentState (create or load)
|  handleOperation()--+----> DocumentState.applyOperation()
|  leaveRoom()     ---+----> scheduleCleanup()
+---------------------+
         |
         +----------+-------------------+
         v          v                   v
+-------------+ +------------------+ +------------------+
| Document    | | OTEngine         | | PresenceService  |
| State       | | (static methods) | |                  |
|             | |                  | |  rooms:          |
| content     | | transform(a, b)  | |  Map<roomId,     |
| revision    | | apply(doc, op)   | |    Map<userId,   |
| history[]   | |                  | |      Presence>>  |
| snapshot()  | | Handles all 4    | |                  |
| restore()   | | transform cases  | |  assignColor()   |
+-------------+ +------------------+ +------------------+

The architecture separates concerns cleanly. WebSocketServer handles transport (connection management, message parsing, broadcasting). RoomManager handles business logic (room lifecycle, operation routing). DocumentState handles document-level concerns (content, history, transforms). OTEngine is a pure function module with no state. PresenceService tracks ephemeral user data separately from document state.

Detecting Local Changes: Diff Algorithm

When the user types in the textarea, the browser fires an input event with the new value. The client needs to figure out what operation (insert or delete) maps to the change. This is done by comparing the previous value (stored in a ref) with the new value.

Simple Diff: Old Value vs New Value
Previous: "Hello World"
             01234567890
New:      "Hello Brave World"
             012345678901234567

Cursor position (selectionStart): 11

Algorithm:
  1. Find common prefix: "Hello " (length 6)
  2. Inserted text = new.slice(6, 6 + (newLen - oldLen))
                   = new.slice(6, 6 + 6)
                   = "Brave "
  3. Generate: Insert("Brave ", position=6)

---

Previous: "Hello World"
New:      "Hello ld"

Algorithm:
  1. Compare lengths: old=11, new=8, diff=3
  2. Cursor position: 6
  3. Deleted from position 6, count 3
  4. Generate: Delete(position=6, count=3)

This simple diff works because textarea edits are always
contiguous: one insert or one delete per input event.
No need for a full diff algorithm like Myers.

This approach works because HTML textarea input events always represent a single contiguous change: one insertion or one deletion. There is no case where a single input event produces multiple disjoint changes. This means a full diff algorithm (like Myers) is unnecessary. Simple length comparison and cursor position give us the exact operation.

Implementation Guide: Build Order

If you want to implement this yourself, here is the build order that lets you test each piece in isolation before integrating.

Phase 1: OT Engine (Pure Functions)

Start with the transform function as a pure module. Define two operation types: Insert(text, position) and Delete(position, count). Implement all four transform cases (Insert-Insert, Insert-Delete, Delete-Insert, Delete-Delete). Write an apply(document, operation) function that executes an operation on a string. Write extensive unit tests: test every sub-case, test commutativity (transforming A against B, then B against A should give convergent results), test with empty strings, test with overlapping deletes.

Phase 2: DocumentState

Build the server-side document model. It holds a string (document content), a revision counter, and an operation history array. The core method is applyOperation(op, userId, clientRevision): filter history for entries after clientRevision, transform the incoming op against each one sequentially, apply the result, increment revision, append to history. Test with simulated concurrent operations.

Phase 3: WebSocket Server

Set up a ws WebSocket server attached to your HTTP server at a specific path (/ws/collab). Track connections in a Map<WebSocket, ClientConnection>. Implement four message handlers: join (create/join room, send full state), operation (route to DocumentState, broadcast result), cursor (broadcast to room), leave (cleanup). Test with two browser tabs.

Phase 4: Client State Machine

Build the React textarea component with the two-state machine. Track inflightOp and buffer as refs (not state, to avoid re-render cycles). On user input, diff old vs new value, generate an operation, and either send immediately (synchronized) or buffer (awaiting ack). On receiving a message, check userId: if it is yours, treat as ack and flush buffer. If it is someone else's, transform inflight and buffer against the remote op, then apply the transformed remote to the textarea value.

Phase 5: Presence and Polish

Add cursor tracking with 100ms throttle. Implement the mirror div technique for rendering remote cursors at pixel positions. Add the room lobby UI for creating and joining rooms. Add snapshot persistence with a 60-second interval. Add room cleanup timers. These are all additive features that layer on top of the core OT system.

Edge Cases That Will Break Your Implementation

These are the failure modes I discovered during implementation. They are not obvious from reading about OT theory.

  • Stale client revision: If a client's revision is older than the oldest history entry, the server cannot transform its operation. Detect this and send an error forcing the client to rejoin with a fresh document. Without this check, you get silent data corruption.
  • Rapid typing with slow network: The buffer can grow large if the server is slow to acknowledge. Each buffered operation must be transformed against every incoming remote operation. With N buffered ops and M remote ops, you get N*M transforms. Cap the buffer or batch rapid keystrokes into single operations.
  • Delete past end of document: A delete operation with position + count exceeding the document length will corrupt the string. Clamp the count to Math.min(count, document.length - position) in the apply function.
  • Empty operations: Transforms can produce Delete operations with count=0. These are no-ops and should be filtered out before broadcasting. Broadcasting no-ops wastes bandwidth and can cause spurious re-renders.
  • WebSocket reconnection: If a client reconnects, it gets a fresh document via join. But if it had buffered operations, those are lost. The user sees their recent edits disappear. Solution: on reconnect, re-apply any buffered ops against the fresh document. This implementation does not solve this, it is acknowledged as a limitation.

OT vs CRDT: Why I Chose OT

CRDTs (Conflict-free Replicated Data Types) are the newer approach to collaborative editing, used by Yjs, Automerge, and Apple's collaboration features. CRDTs encode conflict resolution into the data structure itself, eliminating the need for a central server to sequence operations.

OT vs CRDT Architecture
Operational Transform (OT)          CRDT
+---------------------------+       +---------------------------+
|  Client <---> Server      |       |  Client <---> Client     |
|                           |       |                           |
|  Server is authority      |       |  No central authority    |
|  Sequential revision #s   |       |  Unique IDs per character|
|  Transform concurrent ops |       |  Merge without transform|
|  History log on server    |       |  State embedded in data  |
+---------------------------+       +---------------------------+

  OT Pros:                          CRDT Pros:
  + Simpler mental model            + True peer-to-peer
  + Smaller wire format             + Works offline natively
  + Less metadata per character     + No server bottleneck
  + Proven at scale (Google Docs)   + Mathematically convergent

  OT Cons:                          CRDT Cons:
  - Requires central server         - Larger document size
  - Server is bottleneck            - Tombstone garbage collection
  - Transform complexity grows      - Complex data structures
  - Offline support is hard         - Hard to implement undo

I chose OT because the goal was understanding the algorithm, and OT's transform function is more approachable than CRDT's character-level unique IDs and tree structures. For a production system with offline support or peer-to-peer requirements, CRDTs (via Yjs or Automerge) would be the better choice. For a server-centric architecture where all clients are online, OT remains simpler and more efficient.

Production Gaps

  • Undo/Redo: Requires maintaining a per-user operation stack and transforming undo operations against concurrent edits from other users. This is one of the hardest problems in collaborative editing.
  • Rich text: This implementation handles plain text only. Rich text requires a delta-based operation format (like Quill's Delta) where operations carry formatting attributes alongside positional information.
  • Operation compression: Rapid typing generates one operation per keystroke. Batching consecutive inserts at sequential positions into a single operation reduces bandwidth and history size.
  • Persistent history: For version history (like Google Docs' revision history), the operation log needs to be persisted to a database with pagination support.
  • Distributed server: The single-process server is a bottleneck. Scaling requires partitioning rooms across server instances with a message broker (Redis Pub/Sub) for cross-instance broadcasting.

Takeaways

Operational Transform is deceptively simple at the conceptual level. Two operations, one transform function, revision numbers. But the implementation details are where the difficulty lies: handling all four transform cases correctly, managing the client state machine, adjusting cursor positions through transforms, and recovering from stale clients. One bug in the transform function causes permanent document divergence with no recovery path except forcing all clients to rejoin.

The biggest lesson: collaborative editing is not a WebSocket problem, it is a distributed systems problem. The WebSocket is just the transport. The real challenges are consistency (convergence guarantee), availability (optimistic local edits), and partition tolerance (handling disconnects). Building it from scratch makes you viscerally understand why production systems invest so heavily in testing OT/CRDT implementations.

Get more posts like this

I write about system design, backend engineering, and building npm packages from scratch. Follow along on Substack for new posts.

Subscribe on Substack →