Troubleshooting Common Problems

Solve common problems may be encountered using GemFire XD.

ClassNotFoundException after Upgrade

After upgrading a GemFire XD member, you may receive a ClassNotFoundException when starting a member, referencing package names that begin with com.vmware.sqlfire or com.pivotal.sqlfire. These exceptions occur if you fail to stop and remove older procedure or listener implementations that use package names from the SQLFire product. Stop and remove these objects before upgrading to GemFire XD. See RHEL: Upgrade GemFire XD from RPM or Upgrade GemFire XD from a ZIP File for more information..

Member Startup Problems

When you start GemFire XD members, startup delays can occur if specific disk store files on other members are unavailable. This can occur in a healthy system depending on the order in which members are started up. For example, consider the following startup message for a locator ("locator2):
GemFire XD Locator pid: 23537 status: waiting
Waiting for DataDictionary (DiskId: 531fc5bb-1720-4836-a468-3d738a21af63, Location: /pivotal/locator2/./datadictionary) on: 
 [DiskId: aa77785a-0f03-4441-84f7-6eb6547d7833, Location: /pivotal/server1/./datadictionary]
 [DiskId: f417704b-fff4-4b99-81a2-75576d673547, Location: /pivotal/locator1/./datadictionary]
Here, the startup messages indicate that locator2 is waiting for the persistent datadictionary files on locator1 and server1 to become available. GemFire XD always persists the data dictionary for indexes and tables that you create, even if you do not configure those tables to persist their stored data. The startup messages above indicate that the locator2 member was shut down before it could gracefully shut down in the a distributed system consisting of itself, locator1, and server2, and that locator1 or locator2 might potentially store a newer copy of the data dictionary for the distributed system.
Continuing the startup by booting the server1 data store yields:
Starting GemFire XD Server using locators for peer discovery: localhost[10337],localhost[10338]
Starting network server for GemFire XD Server at address localhost/127.0.0.1[1529]
Logs generated in /pivotal/server1/gfxdserver.log
The server is still starting. 15 seconds have elapsed since the last log message: 
 Region /_DDL_STMTS_META_REGION has potentially stale data. It is waiting for another member to recover the latest data.
My persistent id:

  DiskStore ID: aa77785a-0f03-4441-84f7-6eb6547d7833
  Name: 
  Location: /10.0.1.31:/pivotal/server1/./datadictionary

Members with potentially new data:
[
  DiskStore ID: f417704b-fff4-4b99-81a2-75576d673547
  Name: 
  Location: /10.0.1.31:/pivotal/locator1/./datadictionary
]
Use the "gfxd list-missing-disk-stores" command to see all disk stores that are being waited on by other members.
The data store startup messages indicate that locator1 has "potentially new data" for the data dictionary. In this case, both locator2 and server1 were shut down before locator1 in the system, so those members are waiting on locator1 to ensure that they have the latest version of the data dictionary.

The above messages for data stores and locators can be commonplace when individual members are shut down one-by-one rather than by using gfxd shut-down-all, which allows all members to synchronize and shut down gracefully. However, if the indicated disk store persistence files are available on the missing member, simply start that member and allow the running members to recover. For example, in the above system you would simply start locator1 and allow locator2 and server1 to synchronize their data.

To avoid this type of delayed startup and recovery:
  1. Use shut-down-all to gracefully shut down all data store members after synchronizing disk stores with the available locators.
  2. Use locator to shut down remaining locator members after the data stores have stopped.
  3. If a member cannot be restarted and it is preventing other data stores from starting, use revoke-missing-disk-store to revoke the disk stores that are preventing startup. This can cause some loss of data if the revoked disk store actually contains recent changes to the data dictionary or to table data.
If persistence disk store files for the data dictionary are deleted, moved, or modified, further complications can occur during startup. These problems are generally indicated by a ConflictingPersistendDataException while starting up other members of the system. For example:
ConflictingPersistentDataException: Region /_DDL_STMTS_META_REGION remote member curwen(23695)<v1>:4505 with 
persistent data /10.0.1.31:/pivotal/locator1/./datadictionary created at timestamp 1373667883741 version 0 diskStoreId 
9cf5aea67c6c4374-9d7205f72fecd47c name  was not part of the same distributed system as the local data from 
/10.0.1.31:/pivotal/server1/./datadictionary created at timestamp 1373649392112 version 0 diskStoreId aa77785a0f034441-84f76eb6547d7833 
name  - See log file for details.
If the datadictionary directory is deleted or moved, then GemFire XD creates a new data dictionary upon startup of that member. (Remember, all GemFire XD locators and data stores maintain a persistent data dictionary for the distributed system, even if you do not persist data in the tables.) However, other members in the distributed system may expect the locator or data store to have the previous, deleted version of the datadictionary, in order to recover more recent operations. If this occurs, the newly-created data dictionary conflicts with other members view of the distributed system, and new members that startup throw a ConflictingPersistendDataException.
To resolve a ConflictingPersistendDataException:
  1. Shut down the member that is causing the exception. In the above example, you would shut down remote member curwen(23695) .
  2. Restore the original datadictionary directory in the shut down member, if possible. Then restart the member with the expected data dictionary files.
  3. If you cannot restore the original datadictionary directory, use revoke-missing-disk-store to revoke the missing data dictionary disk store files.

If you cannot resolve startup problems associated with missing or conflicting data dictionary files, you can force the GemFire XD member to complete its startup by using the gemfirexd.datadictionary.allow-startup-errors property. This property enables you to startup a GemFire XD member even if the volume or directory in which a disk store was created no longer exists; you can recreate the disk store manually, after forcing the member to restart.

Connection Problems

These are common problems that occur when connecting to a GemFire XD distributed system:
  • You receive SQL State 08001 Error: 'Failed after trying all available servers: []'

    This problem can be caused if you specify null values for the username and password connection properties in the JDBC connection URL. Some third-party tools specify automatically supply null values but include the connection properties if you do not specify user credentials.

    If authentication is disabled in your distributed system, then you can specify any temporary user name and password value when connecting. Connecting to GemFire XD with JDBC Tools provides more details.

WAN Replication Problems

In WAN deployments, tables may fail to synchronize between two GemFire XD distributed systems if the tables are not identical to one another (see Create Tables with Gateway Senders). If you have configured WAN replication between sites but a table fails to synchronize because of schema differences, follow these steps to correct the situation:
  1. Stop the gateway senders and gateway receivers in each GemFire XD distributed system. See Start and Stop Gateway Senders.
  2. Use ALTER TABLE to add or drop columns on the problem table, to ensure that both tables have the same column definitions. Compare the output of the describe command for each table to ensure that the tables are the same. Or, use write-schema-to-sql in each distributed system to compare the DDL statements used to create each table.
  3. Use the SYS.GET_TABLE_VERSION to verify that both table have the same version in the data dictionary of each GemFire XD cluster. If the versions do not match, use SYS.INCREMENT_TABLE_VERSION with the table having the smaller version to make both table versions equal.
  4. Restart gateway senders and gateway receivers for the distributed systems. See Start and Stop Gateway Senders.

Preventing Disk Full Errors

It is important to monitor the disk usage of GemFire XD members. If a member lacks sufficient disk space for a disk store, the member attempts to shut down the disk store and its associated tables, and logs an error message. After you make sufficient disk space available to the member, you can restart the member. (See Member Startup Problems.) A shutdown due to a member running out of disk space can cause loss of data, data file corruption, log file corruption and other error conditions that can negatively impact your applications.

You can prevent disk file errors using the following techniques:
  • Use default pre-allocation for disk store files and disk store metadata files. Pre-allocation reserves disk space for these files and leaves the member in a healthy state when the disk store is shut down, allowing you to restart the member once sufficient disk space has been made available. Pre-allocation is configured by default.
    Pre-allocation is governed by the following system properties:
    Note: Pivotal recommends using ext4 filesystems on Linux platforms, because ext4 supports preallocation which speeds disk startup performance. If you are using ext3 filesystems in latency-sensitive environments with high write throughput, you can improve disk startup performance by setting the the MAXLOGSIZE property of a disk store to a value lower than the default 1 GB. See CREATE DISKSTORE.
  • Monitor GemFire XD logs for low disk space warnings. GemFire XD logs disk space warnings in the following situations:
    • Log file directory—logs a warning if the available space is less than 100 MB.
    • Disk store directory—logs a warning if the usable space is less than 1.15 times the space required to create a new oplog file.
    • Data dictionary—logs a warning if the remaining space is less than 50 MB.

    You can configure the log message frequency with the gemfire.DISKSPACE_WARNING_INTERVAL system property.

Recovering from Disk Full Errors

If a member of your GemFire XD distributed system fails due to a disk full error condition, add or make additional disk capacity available and attempt to restart the member normally. If the member does not restart and there is a redundant copy of its tables in a disk store on another member, you can restore the member using the following steps:

  1. Delete or move the disk store files from the failed member.
  2. Use the list-missing-disk-stores gfxd command to identify any missing data. You may need to manually restore this data.
  3. Revoke the member using the revoke-disk-store command.
  4. Restart the member.

See Handling Missing Disk Stores.