Troubleshooting Common Problems

Solve common problems may be encountered using GemFire XD.

ClassNotFoundException after Upgrade

After upgrading a GemFire XD member, you may receive a ClassNotFoundException when starting a member, referencing package names that begin with com.vmware.sqlfire or com.pivotal.sqlfire. These exceptions occur if you fail to stop and remove older procedure or listener implementations that use package names from the SQLFire product. Stop and remove these objects before upgrading to GemFire XD. See Performing a Manual Upgrade Using the RPM Distribution or Performing a Manual Upgrade Using the ZIP Distribution for more information..

Member Startup Problems

When you start GemFire XD members, startup delays can occur if specific disk store files on other members are unavailable. This is part of the normal startup behavior and is designed to help ensure data consistency. For example, consider the following startup message for a locator ("locator2):
GemFire XD Locator pid: 23537 status: waiting
Waiting for DataDictionary (DiskId: 531fc5bb-1720-4836-a468-3d738a21af63, Location: /pivotal/locator2/./datadictionary) on: 
 [DiskId: aa77785a-0f03-4441-84f7-6eb6547d7833, Location: /pivotal/server1/./datadictionary]
 [DiskId: f417704b-fff4-4b99-81a2-75576d673547, Location: /pivotal/locator1/./datadictionary]
Here, the startup messages indicate that locator2 is waiting for the persistent datadictionary files on locator1 and server1 to become available. GemFire XD always persists the data dictionary for indexes and tables that you create, even if you do not configure those tables to persist their stored data. The startup messages above indicate that locator1 or locator2 might potentially store a newer copy of the data dictionary for the distributed system.
Continuing the startup by booting the server1 data store yields:
Starting GemFire XD Server using locators for peer discovery: localhost[10337],localhost[10338]
Starting network server for GemFire XD Server at address localhost/[1529]
Logs generated in /pivotal/server1/gfxdserver.log
The server is still starting. 15 seconds have elapsed since the last log message: 
 Region /_DDL_STMTS_META_REGION has potentially stale data. It is waiting for another member to recover the latest data.
My persistent id:

  DiskStore ID: aa77785a-0f03-4441-84f7-6eb6547d7833
  Location: /

Members with potentially new data:
  DiskStore ID: f417704b-fff4-4b99-81a2-75576d673547
  Location: /
Use the "gfxd list-missing-disk-stores" command to see all disk stores that are being waited on by other members.
The data store startup messages indicate that locator1 has "potentially new data" for the data dictionary. In this case, both locator2 and server1 were shut down before locator1 in the system, so those members are waiting on locator1 to ensure that they have the latest version of the data dictionary.

The above messages for data stores and locators may indicate that some members were not started. If the indicated disk store persistence files are available on the missing member, simply start that member and allow the running members to recover. For example, in the above system you would simply start locator1 and allow locator2 and server1 to synchronize their data.

To avoid this type of delayed startup and recovery:
  1. When possible, shut down data store members (shut-down-all) after disk stores have been synchronized in the system. Shut down remaining locator members after the data stores have stopped.
  2. Make that that sure all persistent members are restarted properly. See Recovering from a ConflictingPersistentDataException for more information.
  3. If a member cannot be restarted and it is preventing other data stores from starting, use revoke-missing-disk-store to revoke the disk stores that are preventing startup. This can cause some loss of data if the revoked disk store actually contains recent changes to the data dictionary or to table data. The revoked disk stores cannot be added back to the system later. If you revoke a disk store on a member you will need to delete the associated disk files from that member in order to start it again. Only use the revoke-missing-disk-store command as a last resort.

Recovering from a ConflictingPersistentDataException

If you receive a ConflictingPersistentDataException during startup, it indicates that you have multiple copies of some persistent data and GemFire XD cannot determine which copy to use. Normally GemFire XD uses metadata to automatically determine which copy of persistent data to use. Each member persists, along with the data dictionary or table data, a list of other members that have the data and whether their data is up to date.

A ConflictingPersistentDataException happens when two members compare their metadata and find that it is inconsistent—they either don’t know about each other, or they both believe that the other member has stale data. The following are some scenarios that can cause a ConflictingPersistentDataException.

Independently-created copies

Trying to merge two independently-created distributed systems into a single distributed system causes a ConflictingPersistentDataException. There are a few ways to end up with independently-created systems:

Trying to merge independent systems by pointing all members to the same set of locators then results in a ConflictingPersistentDataException.

GemFire XD cannot merge independently-created data for the same table. Instead, you need to export the data from one of the systems and import it into the other system. See Exporting and Importing Data with GemFire XD.

Starting new members first

Starting a brand new member with no persistent data before starting older members that have persistent data can cause a ConflictingPersistentDataException.

This can happen by accident if you shut down the system, then add a new member to the startup scripts, and finally start all members in parallel. In this case, the new member may start first. If this occurs, the new member creates an empty, independent copy of the data before the older members start up. When the older members start, the situation is similar to that described above in “Independently-created copies.”

In this case, the fix is simply to move aside or delete the (empty) persistence files for the new member, shut down the new member, and finally restart the older members. After the older members have fully recovered, restart the new member.

A network split, with enable-network-partition-detection set to false

With enable-network-partition-detection set to true, GemFire XD detects a network partition and shuts down members to prevent a "split brain." In this case no conflicts should occur when the system is restored.

However, if enable-network-partition-detection is false, GemFire XD cannot prevent a "split brain" after a network partition. Instead, each side of the network partition records that the other side of the partition has stale data. When the partition is healed and persistent members are restarted, they find a conflict because each side believes the other side's members are stale.

In some cases it may be possible to choose between sides of the network partition and keep only the data from one side of the partition. Otherwise you may need to salvage data and import it into a fresh system.

Resolving a ConflictingPersistentDataException

If you receive a ConflictingPersistentDataException, you will not be able to start all of your members and have them join the same distributed system.

First, determine if there is one part of the system that you can recover. For example, if you just added some new members to the system, try to start up without including those members. For the remaining members, use the data extractor tool to extract data from the persistence files and import it into a running system. See Recovering Data from Disk Stores.

Connection Problems

These are common problems that occur when connecting to a GemFire XD distributed system:

WAN Replication Problems

In WAN deployments, tables may fail to synchronize between two GemFire XD distributed systems if the tables are not identical to one another (see Create Tables with Gateway Senders). If you have configured WAN replication between sites but a table fails to synchronize because of schema differences, follow these steps to correct the situation:

  1. Stop the gateway senders and gateway receivers in each GemFire XD distributed system. See Start and Stop Gateway Senders.
  2. Use ALTER TABLE to add or drop columns on the problem table, to ensure that both tables have the same column definitions. Compare the output of the describe command for each table to ensure that the tables are the same. Or, use write-schema-to-sql in each distributed system to compare the DDL statements used to create each table.
  3. Use the SYS.GET_TABLE_VERSION to verify that both table have the same version in the data dictionary of each GemFire XD cluster. If the versions do not match, use SYS.INCREMENT_TABLE_VERSION with the table having the smaller version to make both table versions equal.
  4. Restart gateway senders and gateway receivers for the distributed systems. See Start and Stop Gateway Senders.

Preventing Disk Full Errors

It is important to monitor the disk usage of GemFire XD members. If a member lacks sufficient disk space for a disk store, the member attempts to shut down the disk store and its associated tables, and logs an error message. After you make sufficient disk space available to the member, you can restart the member. (See Member Startup Problems.) A shutdown due to a member running out of disk space can cause loss of data, data file corruption, log file corruption and other error conditions that can negatively impact your applications.

You can prevent disk file errors using the following techniques:

Recovering from Disk Full Errors

If a member of your GemFire XD distributed system fails due to a disk full error condition, add or make additional disk capacity available and attempt to restart the member normally. If the member does not restart and there is a redundant copy of its tables in a disk store on another member, you can restore the member using the following steps:

  1. Delete or move the disk store files from the failed member.
  2. Use the list-missing-disk-stores gfxd command to identify any missing data. You may need to manually restore this data.
  3. Revoke the member using the revoke-disk-store command.
  4. Restart the member.

See Handling Missing Disk Stores.