Query HDFS Tables with HAWQ

GemFire XD supports a PXF driver to enable you to query HDFS table data using HAWQ. (The PXF driver is installed with HAWQ.) In this tutorial, you will map the GemFire XD table that you created in the previous tutorial as a HAWQ external table and use HAWQ to query the data persisted in Hadoop.

Prerequisites

This tutorial requires that you first complete the previous tutorial, Persist Tables to Hadoop.

Procedure
  1. Verify that you have created the necessary HDFS log files for the previous tutorial:
    $ hadoop fs -ls hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY
    Found 112 items
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/0
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/1
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/10
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/100
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/101
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/102
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/103
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/104
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/105
    drwxr-xr-x   - gpadmin hadoop          0 2014-03-29 05:38 hdfs://pivhdsne:8020/flights/APP_FLIGHTAVAILABILITY/106
    [...]

    Complete the previous tutorial if no table data exists.

  2. Open a new terminal window and start the HAWQ interactive command prompt:
    $ psql
    psql (8.2.15)
    Type "help" for help.
    
    gpadmin=# 
  3. Map the GemFire XD HDFS table data as a HAWQ external table using the command:
    gpadmin=# create external table hawq_flightavailability 
      (flight_id varchar(6), segment_number integer, firstclass_seats_taken integer) 
      location ('pxf://pivhdsne:50070/flights/APP.FLIGHTAVAILABILITY?PROFILE=GemFireXD&CHECKPOINT=false') 
      FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    CREATE EXTERNAL TABLE

    The CREATE EXTERNAL TABLE command maps only two of the available columns from FLIGHTAVAILABILITY into HAWQ. As a best practice, you should only map the column values that you intend to query in the external table, to limit the amount data transferred between HDFS and HAWQ. Also, keep in mind that only certain column data types can be mapped between GemFire XD and HAWQ. See Mapping HDFS Tables in HAWQ for more information.

    The &CHECKPOINT=false option indicates that the HAWQ table will query information from all of the table's available HDFS log files. Setting &CHECKPOINT=true or omitting the option entirely would result in querying only the latest available values for the table rows. This would require GemFire XD to first perform a major compaction operation on the available log files to remove intermediate table operations and create compacted log files containing a snapshot of column values. See Compaction for HDFS Log Files for more information.

  4. While still in the HAWQ interactive command prompt, execute a query against the external table:
    gpadmin=# select count(*) from hawq_flightavailability;
     count 
    -------
       518
    (1 row)

    Notice that HAWQ queries the full HDFS data for the table. Whereas GemFire XD tables only query the in-memory, operational data by default, HAWQ external tables query only the persisted HDFS data. You limit the amount of data provided in the table by defining only those columns you want to query, and by defining the range of HDFS log files to include in the external table definition.

  5. HAWQ external tables can also be queried when the GemFire XD distributed system is unavailable. In a terminal window, shut down the GemFire XD cluster using the commands:
    $ gfxd shut-down-all -locators=localhost[10101]
    Connecting to distributed system: locators=localhost[10101]
    Successfully shut down 2 members
    $ gfxd locator stop -dir=/data/gfxd/locator
    The GemFireXD Locator has stopped.
  6. Execute another query against the table using the HAWQ interactive command prompt:
    gpadmin=# select count(*) from hawq_flightavailability where firstclass_seats_taken <> 0;
     count 
    -------
       400
    (1 row)
    

    Because the external table uses only the persistent HDFS data, a connection to the GemFire XD system is not needed for querying historical data.

  7. Exit the HAWQ interactive command prompt:
    gpadmin=# \q
    [pivhdsne:quickstart]$ 
  8. To stop the remaining processes in the virtual machine, execute the script:
    $ ~/Desktop/stop_gfxd.sh
    Note: Because you shut down the GemFire XD distributed system earlier in this tutorial, the stop_gfxd.sh script will display some errors when trying to connect to and shut down the running system.