Author Avatar

Pradeep Mishra

1

Share post:

INTRODUCTION

Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and governance at the corporate level, all of which are hosted on public, private, and multi-cloud deployments. CDP is the successor to Cloudera’s two previous Hadoop distributions: Cloudera Distribution of Hadoop (CDH) and Hortonworks Data Platform (HDP)

You need to know where your tables are located and the property changes that the upgrade process makes. You need to perform some post-migration tasks before using Hive tables and handle semantic changes.

Understanding Apache Hive 3 major design features, such as default ACID transaction processing, can help you use Hive to address the growing needs of enterprise data warehouse systems. CDP comes with Hive 3 which can create ACID (atomic, consistent,isolated, and durable) tables for unlimited transactions or for insert only transactions. These tables are hive managed tables. You can access ACID tables for unlimited transactions from Hive, but not from Impala.

You can create an external table for non-transactional use. Because Hive control of the external table is weak, the table is not ACID compliant.The following diagram depicts the Hive table types.

The following matrix includes the types of tables you can create using hive, whether or not ACID properties are supported, required storage format, and key SQL operations.

Hive Table Location

New tables that you create in CDP are stored in either the Hive warehouse for managed tables or the Hive warehouse for external tables.

The following default warehouse locations are in the HDFS file system:

  • /warehouse/tablespace/managed/hive
  • /warehouse/tablespace/external/hive

In CDP, Hive does not allow the LOCATION clause in queries to create a managed table.Using this clause, you can specify a location only when creating external tables. For example:

CREATE EXTERNAL TABLE my_external_table (a string, b string)  
ROW FORMAT SERDE 'com.mytables.MySerDe' 
WITH SERDEPROPERTIES ( "input.regex" = "*.csv")
LOCATION '/warehouse/tablespace/external/hive/marketing';

In CDP, Hive has been enhanced to include a MANAGEDLOCATION clause as shown in the following syntax:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION external_table_path]
  [MANAGEDLOCATION managed_table_directory_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

In the MANAGEDLOCATION clause, you specify a top level directory for managed tables when creating a Hive database. Do not set LOCATION and MANAGEDLOCATION to the same HDFS path.
To determine the managed or external table type, you can run the DESCRIBE EXTENDED table_name command.

Hive Table Changes

To improve useability and functionality, Hive 3 significantly changed table creation. Hive has changed table creation in the following ways:

  • Creates ACID-compliant table, which is the default in CDP
  • Supports simple writes and inserts
  • Writes to multiple partitions
  • Inserts multiple data updates in a single SELECT statement
  • Eliminates the need for bucketing.

Before Upgrade to CDP

In CDH and HDP 2.6.5, by default CREATE TABLE created a non-ACID table.

After Upgrade to CDP

In CDP, by default CREATE TABLE creates a full, ACID transactional table in ORC format.

For example, I have to changed below script from

CREATE IF NOT EXISTS db_name.table_name
(key String, value String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n" STORED AS TEXTFILE;

To

CREATE EXTERNAL IF NOT EXISTS db_name.table_name
(key String, value String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n" STORED AS TEXTFILE
LOCATION '/warehouse/tablespace/external/hive/marketing'
TBLPROPERTIES ('external.table.purge'='true');

after CDP upgrade so that I don’t have to do more changes in spark side which is using these tables. Access to Hive 3 managed (ACID) tables requires the HWC (Hive Warehouse Connector). Spark Shell, PySpark, spark-submit are HWC supported applications.

For more information: Hive Warehouse Connector for accessing Apache Spark data

I hope you have enjoyed this post and it helped you to understand about table structure changes in CDP Upgrade that come with Hive 3 . Please like and share and feel free to comment if you have any suggestions or feedback.

Criticism of Design Pattern
Apache Hive 3 Changes in CDP Upgrade: Part-2

Discussion

Leave a Reply