Getting Started with GemFire XD / QuickStart Tutorials |
Up to this point, the tutorial clusters have managed all table primary keys and indexes in memory, while persisting table data to local GemFire XD disk store files. In this tutorial, you will create a table that persists data to HDFS log files.
Prerequisites
If you are not using the virtual machine, ensure that you have followed the installation instructions to copy the necessary Pivotal HD library files to the /ext-lib subdirectory of your GemFire XD directory, and that the user running your local GemFire XD members has permission to access HDFS nodes. See Installing GemFire XD and Configuration Requirements for Secure HDFS for more information.
$ ~/Desktop/start_gfxd.sh
$ hadoop fs -ls hdfs://pivhdsne:8020/ Found 7 items drwxr-xr-x - hdfs hadoop 0 2014-03-10 18:39 hdfs://pivhdsne:8020/apps drwxr-xr-x - gpadmin hadoop 0 2014-03-10 18:44 hdfs://pivhdsne:8020/hawq_data drwxr-xr-x - hdfs hadoop 0 2014-03-10 18:40 hdfs://pivhdsne:8020/hive drwxr-xr-x - mapred hadoop 0 2014-03-10 18:38 hdfs://pivhdsne:8020/mapred drwxrwxrwx - hdfs hadoop 0 2014-03-10 18:38 hdfs://pivhdsne:8020/tmp drwxrwxrwx - hdfs hadoop 0 2014-03-10 18:45 hdfs://pivhdsne:8020/user drwxr-xr-x - hdfs hadoop 0 2014-03-10 18:39 hdfs://pivhdsne:8020/yarn
(If you are not using the Pivotal HD Single Node VM, substitute the correct URL for your HDFS filesystem browser.) Keep this second terminal window open in order to view results in HDFS.
$ cd /usr/lib/gphd/gfxd/quickstart $ gfxd gfxd version 1.4.0 gfxd> connect client 'localhost:1527';
gfxd> create hdfsstore flightstore namenode 'hdfs://pivhdsne:8020' homedir '/flights'; 0 rows inserted/updated/deleted
$ hadoop fs -ls hdfs://pivhdsne:8020/flights Found 2 items drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:35 hdfs://pivhdsne:8020/flights/.hopmeta -rw-r--r-- 3 gpadmin hadoop 8 2014-03-29 05:35 hdfs://pivhdsne:8020/flights/cleanUpInterval
Because no tables are using the HDFS store yet, there's only a single directory available for storing metadata. The metadata at this point consists only of the DDL command that used to create the HDFS store itself. As you create tables that use the HDFS store, DDL for those tables is added to the metadata directory, and is used by MapReduce jobs to understand the schema of the table data stored in HDFS log files.
gfxd> drop table if exists app.flightavailability; 0 rows inserted/updated/deleted
gfxd> CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID VARCHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER DEFAULT 0, BUSINESS_SEATS_TAKEN INTEGER DEFAULT 0, FIRSTCLASS_SEATS_TAKEN INTEGER DEFAULT 0, PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER, FLIGHT_DATE) ) PARTITION BY PRIMARY KEY eviction by criteria (firstclass_seats_taken=0) evict incoming persistent hdfsstore (flightstore); 0 rows inserted/updated/deleted
In this create table command, the HDFSSTORE clause specifies the HDFS store that GemFire XD will use to perist data for the table. The EVICTION BY CRITERIA clause defines the in-memory, operational data set for the table. (As an alternative, you can configure the table to evict matching data periodically, using the EVICTION FREQUENCY clause.) Each time a new row is inserted or updated with the column value firstclass_seats_taken=0, GemFire XD evicts that row from memory. The row remains available in HDFS for processing historical data using MapReduce jobs or using HAWQ. You'll see the results of this eviction behavior in the next few steps.
If you do not include the PERSISTENT clause then no in-memory data for the table is restored. Although that data still resides in HDFS log files, GemFire XD does not provide a mechanism for restoring the data to heap or off-heap memory.
gfxd> run 'loadFLIGHTAVAILABILITY1.sql'; gfxd> run 'loadFLIGHTAVAILABILITY2.sql';
$ hadoop fs -ls hdfs://pivhdsne:8020/flights Found 3 items drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:37 hdfs://pivhdsne:8020/flights/.hopmeta drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY -rw-r--r-- 3 gpadmin hadoop 8 2014-03-29 05:35 hdfs://pivhdsne:8020/flights/cleanUpInterval
$ hadoop fs -ls hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY Found 112 items drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/0 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/1 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/10 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/100 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/101 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/102 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/103 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/104 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/105 drwxr-xr-x - gpadmin hadoop 0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/106 [...]
A subdirectory is created for each bucket of the partitioned table. See How GemFire XD Manages HDFS Data for more information abou the different types of HDFS log files that GemFire XD creates.
After GemFire XD creates the log files, they remain available in HDFS for processing with MapReduce jobs or HAWQ until you remove the files. For the type of read/write table that you created in this tutorial, GemFire XD also performs automatic compaction to combine multiple log files into a smaller number of larger log files. See Compaction for HDFS Log Files for more information.
gfxd> select count(*) from flightavailability; 1 ----------- 400 1 row selected gfxd> select count(*) from flightavailability where firstclass_seats_taken=0; 1 ----------- 0 1 row selected
gfxd> select count(*) from flightavailability --GEMFIREXD-PROPERTIES queryHDFS=true \n ; 1 ----------- 518 1 row selected gfxd> select count(*) from flightavailability --GEMFIREXD-PROPERTIES queryHDFS=true \n where firstclass_seats_taken=0; 1 ----------- 118 1 row selected
With the queryHDFS hint specified, the queries now operate against the full HDFS data and return the complete data set, including rows that were evicted to HDFS.
gfxd> select count(*) as operational, count_estimate('app.flightavailability') as estimate from flightavailability; OPERATIONAL|ESTIMATE -------------------------------- 400 |513 1 row selected
gfxd> exit;