How Multi-site Event Distribution Works

In a multi-site configuration, GemFire XD distributes table DML operations between remote distributed systems, with a minimum impact on each system's performance. DML events are distributed only for the tables that you configure for replication by using a gateway receiver.

The DML event contains the originating site's distributed system ID, and the ID of each site to which it sends the event. This information ensures that no site re-forwards the event to a site that has already received the event. When a GemFire XD site receives an event for a table DML operation, it forwards the event to the other sites it knows about (sites that correspond to configured gateway senders for the table), but it excludes those sites that it knows have already received the event for the table.

Note: A DML event does not record the ID of the sites that receive and apply the DML event. This means that unsupported WAN topologies can result in more than one copy of an event being applied to a single site. This can lead to duplicate data or failed DML execution in the table, and must be avoided. See the discussion of supported and unsupported topologies under Multi-site Topologies.

How a Gateway Processes Its Queue

Each primary gateway sender contains a processor thread that reads messages from its queue, batches them, and distributes them to the connected site. To process the queue, the thread takes the following actions:
  1. Reads messages from the queue.
  2. Creates a batch of the messages.
  3. Synchronously distributes the batch to the remote site and waits for a reply.
  4. Removes the batch from the queue after the remote site has successfully replied.

Because the batch is not removed from the queue until after the remote site has replied, the message cannot get lost. However, this also means that a message can be processed more than once in certain failure scenarios. If a site goes offline in the middle of processing a batch of messages, that same batch is sent again after the site is back online. See About High Availability for WAN Deployments.

The batch is configurable via the batch size and batch time interval settings, which you can specify when you create a gateway sender (see CREATE GATEWAYSENDER). A batch of messages is processed by the queue when either the batch size or the time interval is reached. In an active network, it is likely that the batch size will be reached before the time interval. In an idle network, the time interval will most likely be reached before the batch size. This may result in some network latency that corresponds to the time interval.

How a Gateway Handles Batch Processing Failure

These types of exceptions can occur during batch processing:
  • A gateway receiver fails with acknowledgment. If processing fails while the receiving gateway is processing a batch, it replies with a failure acknowledgment containing the exception, including the identity of the message that failed, and the ID of the last message successfully processed. The sending gateway removes the successfully processed messages and the failed message from the queue and logs the exception and the failed message information. The sender then continues processing the messages remaining in the queue.
  • The receiving gateway fails without acknowledgment. If the receiving gateway does not acknowledge a sent batch, the sender does not know which messages were successfully processed, so it re-sends the entire batch.
  • No remote gateways are available. If a batch processing exception occurs because no remote gateways are available, the batch remains on the queue. The sending gateway waits for a time, then attempts to re-send the batch. The time periods between attempts is five seconds. The existing server monitor continuously attempts to connect to the remote gateway, so eventually a connection is made and queue processing continues. Messages build up in the queue and possibly overflow to disk in the meantime.