Persist Tables to Hadoop

Up to this point, the tutorial clusters have managed all table primary keys and indexes in memory, while persisting table data to local GemFire XD disk store files. In this tutorial, you will create a table that persists data to HDFS log files.

Prerequisites

Note: This tutorial and the next tutorial require that you use GemFire XD with Pivotal HD Enterprise for Hadoop services. The simplest way to complete these tutorials is to download and run the Pivotal HD Single Node VM, available at http://www.pivotal.io/big-data/pivotal-hd. The tutorial assumes that you perform the steps from within the downloaded virtual machine.

If you are not using the virtual machine, ensure that you have followed the installation instructions to copy the necessary Pivotal HD library files to the /ext-lib subdirectory of your GemFire XD directory, and that the user running your local GemFire XD members has permission to access HDFS nodes. See Installing GemFire XD and Configuration Requirements for Secure HDFS for more information.

Procedure
  1. Start the virtual machine and open a terminal window.
  2. Execute the following script to start the GemFire XD distributed system along with the required Pivotal HD services:
    $ ~/Desktop/start_gfxd.sh
    Note: It can take several minutes or more to start all services on the virtual machine, depending on the configuration of your host computer.
  3. To ensure that you can access the Pivotal HD namenode, open a new terminal window and use the Hadoop filesystem browser to list the contents of the Pivotal HD filesystem:
    $ hadoop fs -ls hdfs://pivhdsne:8020/
    Found 7 items
    drwxr-xr-x   - hdfs    hadoop          0 2014-03-10 18:39 hdfs://pivhdsne:8020/apps
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-10 18:44 hdfs://pivhdsne:8020/hawq_data
    drwxr-xr-x   - hdfs    hadoop          0 2014-03-10 18:40 hdfs://pivhdsne:8020/hive
    drwxr-xr-x   - mapred  hadoop          0 2014-03-10 18:38 hdfs://pivhdsne:8020/mapred
    drwxrwxrwx   - hdfs    hadoop          0 2014-03-10 18:38 hdfs://pivhdsne:8020/tmp
    drwxrwxrwx   - hdfs    hadoop          0 2014-03-10 18:45 hdfs://pivhdsne:8020/user
    drwxr-xr-x   - hdfs    hadoop          0 2014-03-10 18:39 hdfs://pivhdsne:8020/yarn

    (If you are not using the Pivotal HD Single Node VM, substitute the correct URL for your HDFS filesystem browser.) Keep this second terminal window open in order to view results in HDFS.

  4. Change to the GemFire XD quickstart directory and connect to the GemFire XD distributed system as a client using gfxd:
    $ cd /usr/lib/gphd/gfxd/quickstart
    $ gfxd
    gfxd version 1.3.1
    gfxd> connect client 'localhost:1527';
  5. In the gfxd session terminal, create an HDFS store to use for persisting table data to Hadoop:
    gfxd> create hdfsstore flightstore namenode 'hdfs://pivhdsne:8020' homedir '/flights';
    0 rows inserted/updated/deleted
    Note: The CREATE HDFSSTORE command has many additional properties to configure the queues used to persist table data to HDFS and the compaction behavior for read/write HDFS files. This file does not specify most of these options, so default values are used. See CREATE HDFSSTORE for more information about available options and defaults.
  6. GemFire XD places the new HDFS store directory in the /flights subdirectory in HDFS. You can verify this using the hadoop command from the terminal prompt:
    $ hadoop fs -ls hdfs://pivhdsne:8020/flights
    Found 2 items
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:35 hdfs://pivhdsne:8020/flights/.hopmeta
    -rw-r--r--   3 gpadmin hadoop          8 2014-03-29 05:35 hdfs://pivhdsne:8020/flights/cleanUpInterval

    Because no tables are using the HDFS store yet, there's only a single directory available for storing metadata. The metadata at this point consists only of the DDL command that used to create the HDFS store itself. As you create tables that use the HDFS store, DDL for those tables is added to the metadata directory, and is used by MapReduce jobs to understand the schema of the table data stored in HDFS log files.

  7. The table that we will configure to use the HDFS store is the FLIGHTAVAILABILITY table. First use the gfxd terminal window drop the current version of this table from the schema if it already exists:
    gfxd> drop table if exists app.flightavailability;
    0 rows inserted/updated/deleted
    
  8. Create a new version of the table, using both the EVICTION BY CRITERIA clause and the HDFSSTORE clause, as follows:
    gfxd> CREATE TABLE FLIGHTAVAILABILITY
        (
           FLIGHT_ID VARCHAR(6) NOT NULL ,
           SEGMENT_NUMBER INTEGER NOT NULL ,
           FLIGHT_DATE DATE NOT NULL ,
           ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0,
           BUSINESS_SEATS_TAKEN INTEGER DEFAULT 0,
           FIRSTCLASS_SEATS_TAKEN INTEGER DEFAULT 0,
           PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE)
        )
        PARTITION BY PRIMARY KEY
        eviction by criteria (firstclass_seats_taken=0)
        evict incoming
        persistent
        hdfsstore (flightstore);
    0 rows inserted/updated/deleted
    

    In this create table command, the HDFSSTORE clause specifies the HDFS store that GemFire XD will use to perist data for the table. The EVICTION BY CRITERIA clause defines the in-memory, operational data set for the table. (As an alternative, you can configure the table to evict matching data periodically, using the EVICTION FREQUENCY clause.) Each time a new row is inserted or updated with the column value firstclass_seats_taken=0, GemFire XD evicts that row from memory. The row remains available in HDFS for processing historical data using MapReduce jobs or using HAWQ. You'll see the results of this eviction behavior in the next few steps.

    Note: The PERSISTENT clause in this statement persists only the in-memory data for the table, using GemFire XD operational log files. This ensures that the operational data set remains available through cluster restarts. See Persist Tables to Disk for a tutorial showing basic GemFire XD persistence to local disk stores.
  9. Load data into the table by running the following SQL scripts, which are installed to the quickstart subdirectory:
    gfxd> run 'loadFLIGHTAVAILABILITY1.sql';
    gfxd> run 'loadFLIGHTAVAILABILITY2.sql';
  10. In the Linux terminal window, re-run the hadoop command to list the contents of the HDFS store:
    $ hadoop fs -ls hdfs://pivhdsne:8020/flights
    Found 3 items
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:37 hdfs://pivhdsne:8020/flights/.hopmeta
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY
    -rw-r--r--   3 gpadmin hadoop          8 2014-03-29 05:35 hdfs://pivhdsne:8020/flights/cleanUpInterval
    GemFire XD created a new directory, APP_FLIGHTAVAILABILITY for the table data. Individual log files for the persisted data are stored in subdirectories of the table directory:
    $ hadoop fs -ls hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY
    Found 112 items
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/0
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/1
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/10
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/100
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/101
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/102
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/103
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/104
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/105
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/106
    [...]

    A subdirectory is created for each bucket of the partitioned table. See How GemFire XD Manages HDFS Data for more information abou the different types of HDFS log files that GemFire XD creates.

    After GemFire XD creates the log files, they remain available in HDFS for processing with MapReduce jobs or HAWQ until you remove the files. For the type of read/write table that you created in this tutorial, GemFire XD also performs automatic compaction to combine multiple log files into a smaller number of larger log files. See Compaction for HDFS Log Files for more information.

  11. In the gfxd session, execute the following queries on the new table:
    gfxd> select count(*) from flightavailability;
    1          
    -----------
    400        
    
    1 row selected
    gfxd> select count(*) from flightavailability where firstclass_seats_taken=0;
    1          
    -----------
    0          
    
    1 row selected
    By default, queries against HDFS tables only access the in-memory, operational data for the table. As you can see from the second query, there are no in-memory rows where the values of the firstclass_seats_taken column are equal to 0. This is because, in the CREATE TABLE statement, you specified that all tables that meet this criteria should be evicted from memory. In order to query the full table data (both in-memory, operational data and data persisted to HDFS), you must include the queryHDFS hint in your query:
    gfxd> select count(*) from flightavailability --GEMFIREXD-PROPERTIES queryHDFS=true \n ;
    1          
    -----------
    518        
    
    1 row selected
    gfxd> select count(*) from flightavailability --GEMFIREXD-PROPERTIES queryHDFS=true \n where firstclass_seats_taken=0;
    1          
    -----------
    118        
    
    1 row selected
    
    Note: Because the queryHDFS hint is specified in a SQL comment, it must appear at the end of a line. In order to continue the SQL command after the comment, include the \n characters (or a return) after defining the hint and before terminating the SQL command. See Overriding Optimizer Choices for more information.

    With the queryHDFS hint specified, the queries now operate against the full HDFS data and return the complete data set, including rows that were evicted to HDFS.

  12. Using count(*) with queryHDFS=true is fine for the small table we just created. But executing this statement on a large table could take a very long time, because GemFire XD has to iterate over all of the table's HDFS data. For this reason, GemFire XD provides a convenience function, COUNT_ESTIMATE, to quickly return an estimate of the total number of rows for an HDFS read/write table:
    gfxd> select count(*) as operational, count_estimate('app.flightavailability') as estimate from flightavailability;
    OPERATIONAL|ESTIMATE            
    --------------------------------
    400        |513                 
    
    1 row selected
    
  13. Exit the gfxd session:
    gfxd> exit;