CREATE HDFSSTORE

Creates a connection to a Hadoop name node in order to persist one or more tables to HDFS. Each connection defines the HDFS NameNode and directory to use for persisting data, as well as GemFire XD-specific options to configure the queue used to persist table events, enable persistence for the connection, compact the HDFS operational logs, and so forth.

Syntax

CREATE HDFSSTORE store-name
  NAMENODE 'url'
  [ HomeDir 'directory-name' ]

  [ BatchSize integer-constant ]
  [ BatchTimeInterval integer-constant { MILLISECONDS | SECONDS | MINUTES | HOURS | DAYS } ]
  [ MaxQueueMemory integer-constant ]
  [ DispatcherThreads integer-constant ]
 
  [ QueuePersistent boolean-constant ] 
  [ DiskSynchronous boolean-constant ]
  [ DiskStoreName store-name ]

  [ MinorCompact boolean-constant ]
  [ MaxInputFileSize integer-constant ]
  [ MinInputFileCount integer-constant ]
  [ MaxInputFileCount integer-constant ]
  [ MinorCompactionThreads integer-constant ]
 
  [ MajorCompact boolean-constant ]
  [ MajorCompactionInterval integer-constant { MILLISECONDS | SECONDS | MINUTES | HOURS | DAYS } ]
  [ MajorCompactionThreads integer-constant ]

  [ MaxWriteOnlyFileSize integer-constant ]
  [ WriteOnlyFileRolloverInterval  integer-constant { MILLISECONDS | SECONDS | MINUTES | HOURS | DAYS } ]

  [ BlockCacheSize integer-constant ]
  [ ClientConfigFile 'file-name' ]

  [ PurgeInterval integer-constant { MILLISECONDS | SECONDS | MINUTES | HOURS | DAYS } ]
Note: An HDFSSTORE can persist table data using either the HDFS write-only model or the HDFS read/write model; you specify the HDFS persistence model using the CREATE TABLE statement. Although multiple tables can use the same HDFSSTORE for persistence, you will generally need to create multiple HDFSSTORE configurations to modify the queue and compaction behavior for each table.
Note: All authentication and authorization for HDFS access is tied to the user that executes the GemFire XD member process. For example, when you create an HDFS store in GemFire XD, the default directory structure for the store is created under the /user/user-name directory in HDFS, where user-name is the process owner for GemFire XD. (You can override this default by specifying an absolute path beginning with the "/" character, or by setting an alternate HDFS root directory for servers using the hdfs-root-dir property.) The GemFire XD user must be able to authenticate with Kerberos, and must have read/write permission on the HDFS store directory on each Hadoop cluster node. See Configuration Requirements for Secure HDFS.
store-name
(Required.) A unique identifier for the HDFS store configuration.
NAMENODE
(Required.) The URL of the Hadoop NameNode for your Pivotal HD cluster (for example, hdfs://server-name:8020).
Note: Because Pivotal Command Center installs namenode HA by default, you can specify the logical name for the namenode (specified in hdfs-site.xml) instead of the namenode URL.
Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
HomeDir
The Hadoop directory in which GemFire XD stores GemFire XD table persistence files for this store. The value must not contain the Hadoop NameNode URL. The owner of the GemFire XD JVM process must have read and write access to this directory in Hadoop.

If you provide a directory name or relative path for HomeDir, then GemFire XD creates the directory in HDFS relative to the Hadoop root directory specified by the hdfs-root-dir property. (If you do not specify a value for hdfs-root-dir, then the Hadoop root directory is /user/user_name, where user-name is the process owner for GemFire XD.)

If you omit the HomeDir option from the CREATE HDFSSTORE statement, then by default GemFire XD creates a directory of the same name as the HDFS store (store-name) in the HDFS root directory. If you omit HomeDir and use the default hdfs-root-dir property, this corresponds to /user/user_name/store-name.

As a best practice, always create HDFS store directories relative to a single HDFS root directory. As an alternative, you can specify an absolute path beginning with the "/" character to override the default root location.

Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
BatchSize
GemFire XD creates an internal queue where it stores the table events that are eventually written to HDFS log files. Multiple events are written to HDFS in batches, and the BATCHSIZE parameter defines the maximum size (in megabytes) of each batch that is written to the Hadoop directory. The default size of a batch is 32 MB. This parameter, along with BatchTimeInterval, determine the frequency with which GemFire XD writes queued events.
BatchTimeInterval
The maximum time that can elapse between writing batches to HDFS. The default is 60000 milliseconds (1 minute).
MaxQueueMemory
The maximum amount of memory in megabytes that the queue can consume before overflowing to disk. The default is 100 MB. GemFire XD creates a single HDFS queue per set of colocated tables that are persisted to HDFS.
Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
DispatcherThreads
The maximum number of threads used to write batches of events from the queue to HDFS. The default is 5 threads. GemFire XD creates a single HDFS queue per set of colocated tables that are persisted to HDFS. If you have a large number of clients that perform HDFS table operations, then you may need to increase the number of dispatcher threads in the queue to avoid bottlenecks (queues overflowing to GemFire XD operational log files) when writing data to HDFS.
Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
QueuePersistent
Include this option to persist the event queue that GemFire XD uses to send table data to HDFS. (The queue is persisted to a local GemFire XD disk store.) By default an HDFS store queue is not persistent.
Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
DiskSynchronous
If you enable persistence with QueuePersistent, you can include the DiskSynchronous option to enable or disable asynchronous writes to the local GemFire XD disk store. Specify FALSE to enable asynchronous writes to the disk store. By default (TRUE) a persistent event queue performs synchronous writes to the local GemFire XD disk store.
Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
DiskStoreName
The named disk store to use for storing the queue overflow, or for persisting the queue (if QueuePersistent is specified). If you specify a value, the named disk store must exist. If you specify a null value or you omit this option, GemFire XD uses the default disk store for overflow and queue persistence.
MinorCompact
Specify TRUE to enable automatic minor compaction for the HDFS read/write log files. Minor compaction reduces the number of files in HDFS in order to avoid performance degradation in HDFS and the GemFire XD cluster.
Note: Do not disable minor compaction unless you tune other HDFS parameters to avoid severe performance degradation. Turning off minor compaction can cause a very large number of HDFS log files to be created, which can potentially exhaust HDFS receiver threads and/or client sockets. To offset these problems, increase the BatchTimeInterval and BatchSize options to create a fewer number of HDFS log files. As a best practice, leave minor compaction enabled unless compaction causes excessive I/O overhead in HDFS that cannot be resolved by tuning compaction behavior.
MaxInputFileSize
The maximum size of a file (in megabytes) that GemFire XD will consider for minor compaction cycles. Files larger than this value are only affected during major compaction. The default is 512 MB.
MinInputFileCount
The minimum number of input files per bucket that can be created before GemFire XD begins to automatically compact HDSF log files. GemFire XD performs no minor compaction until this number of files have been created for a given bucket, after which files that are smaller than MAXINPUTFILESIZE may be compacted. The default is 4.
Note: Use caution when increasing the MinInputFileCount value, as it applies to each bucket persisted by the HDFS store, rather than to the HDFS store as a whole. As more tables target the HDFS store, additional HDFS file handles are required to manage the number of open files. A large number of buckets combined with a high MinInputFileCount can result in thousands of files opened in HDFS. Ensure that you have configured your operating system to support large numbers of file descriptors, as described in Supported Configurations and System Requirements.
MaxInputFileCount
The maximum number of input files per bucket to include in a minor compaction cycle. The default is 10.
MinorCompactionThreads
The maximum number of threads that GemFire XD uses to perform minor compaction in this HDFS store. Within a given bucket, only one compaction cycle (minor or major) can run at a given time. You can increase the number of threads used for compactions on different buckets as necessary in order to fully utilize the performance of your HDFS cluster and its disks. By default GemFire XD uses 10 threads for minor compaction and 2 threads for major compaction.
MajorCompact
Specify TRUE to enable automatic major compaction for the HDFS read/write log files. Major compaction removes deleted events from the HDFS log files, which can save space in HDFS and improve performance when reading from HDFS log files. GemFire XD performs major compaction by default. As major compaction process can be long-running and I/O-intensive, tune the performance of major compaction using MajorCompactionInterval and MajorCompactionThreads.
MajorCompactionInterval
The amount of time after which GemFire XD performs the next major compaction cycle. The default is 720 minutes. The minimum compaction interval is 1 minute. GemFire XD converts and stores the specified MajorCompactionInterval in minutes.
MajorCompactionThreads
The maximum number of threads that GemFire XD uses to perform major compaction in this HDFS store. Within a given bucket, only one compaction cycle (minor or major) can run at a given time. You can increase the number of threads used for compactions on different buckets as necessary in order to fully utilize the performance of your HDFS cluster and its disks. By default GemFire XD uses 10 threads for minor compaction and 2 threads for major compaction.
MaxWriteOnlyFileSize
For HDFS write-only tables, this defines the maximum size (in megabytes) that an HDFS log file can reach before GemFire XD closes the file and begins writing to a new file. This clause is ignored for HDFS read/write tables. Keep in mind that the operational logs files are not available for MapReduce processing until the file is closed; you can also set WriteOnlyFileRolloverInterval to specify the maximum amount of time an HDFS log file remains open. The default is 256 MB.
WriteOnlyFileRolloverInterval
For HDFS write-only tables, this defines the maximum time that can elapse before GemFire XD closes an HDFS log file and begins writing to a new file. This clause is ignored for HDFS read/write tables. The default is 3600 seconds. The minimum value is 1 second.
BlockCacheSize
The size of the block cache as a percentage (a float in the range 0 to 100) of the available heap space. The default is 10.
Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
ClientConfigFile
The full path to the HDFS client configuration file to use for authenticating the GemFire XD service principal to Kerberos. You can omit this option if you installed GemFire XD as a component of Pivotal HD using the Pivotal Command Center CLI. If you installed GemFire XD as a standalone product and your Pivotal HD installation uses secure HDFS, then you must specify the path to a valid configuration file, and each GemFire XD member that writes to HDFS must have a valid configuration file at that same path. See Configuration Requirements for Secure HDFS.
Note: You cannot use ALTER HDFSSTORE to change this property value after the store is created.
PurgeInterval
Defines the amount of time that GemFire XD allows expired HDFS log files to remain available for MapReduce jobs. After this interval has passed, GemFire XD deletes the expired files. The default is 720 minutes (12 hours). The minimum purge interval is 1 minute. GemFire XD converts and stores the specified PurgeInterval in minutes.

Example

Create a persistent connection to a Hadoop directory, storing HDFS log files in the hdfsstore1 subdirectory of the root directory defined by hdfs-root-dir. The HDFS event queue is also persisted using the default GemFire XD disk store:

CREATE HDFSSTORE hdfsstore1
  NAMENODE 'hdfs://gfxd1:8020'
  QueuePersistent true;
Store HDFS log files in the stream-tables directory under the HDFS root directory defined by hdfs-root-dir. This command configures an HDFS queue where data is written to the HDFS store in batches as large as 10 megabytes, and the queue is flushed to the store at least once every 2000 milliseconds. Eight threads are used to write batches from the queue to HDFS log files:
CREATE HDFSSTORE streamingstore
  NAMENODE 'hdfs://gfxd1:8020'
  HomeDir 'stream-tables' 
  BatchSize 10
  BatchTimeInterval 2000 milliseconds
  DispatcherThreads 8
  QueuePersistent true;
Configure an HDFSSTORE with compaction settings for HDFS read/write persistence. Here, GemFire XD performs both minor and major compaction on the resulting HDFS store persistence files. Minor compaction is performed on files up to 12 MB in size, and can involve as many as 8 files at a time. Any files larger than 12 MB are compacted during the major compaction cycle, which occurs every 10 minutes. A maximum of 3 threads are used in either compaction cycle:
CREATE HDFSSTORE readwritestore
  NAMENODE 'hdfs://gfxd1:8020'
  BatchSize 10
  BatchTimeInterval 2000 milliseconds
  QueuePersistent true
  MinorCompact true
  MajorCompact true
  MaxInputFileSize 12
  MinInputFileCount 1
  MaxInputFileCount 8
  MinorCompactionThreads 3
  MajorCompactionInterval 10 minutes
  MajorCompactionThreads 3;