TheCodersStop - SQOOP : Data transfer between Hadoop and RDBMS

Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data between HDFS (Hadoop storage) and RDBMS (relational database) servers like SQLite, Oracle, MySQL, Netezza, Teradata, Postgres etc. Apache Sqoop imports data from relational databases to HDFS, and exports data from HDFS to relational databases. It efficiently transfers bulk data between Hadoop and external data stores such as enterprise data warehouses, relational databases, etc.

This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”.

Additionally, Sqoop is used to import data from external datastores into Hadoop ecosystem’s tools like Hive & HBase.

An in-depth introduction to SQOOP architecture — **SQOOP Architecture**

Why Sqoop?

When the data residing in the relational database management systems need to be transferred to HDFS. The task of writing MapReduce code for importing and exporting data from the relational database to HDFS is uninteresting & tedious. This is where Apache Sqoop comes to rescue and removes their pain. It automates the process of importing & exporting the data.

Sqoop makes the life of developers easy by providing CLI for importing and exporting data. They just have to provide basic information like database authentication, source, destination, operations etc. It takes care of the remaining part.

Sqoop internally converts the command into MapReduce tasks, which are then executed over HDFS. It uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.

Mostly two sqoop commands are used by programmers which are sqoop import and export. Let’s discuss them one by one.

SQOOP IMPORT Command

Sqoop import command imports a table from an RDBMS to HDFS. In our case, we are going to import tables from MySQL databases to HDFS.

Syntax
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

As you can see in the below image, we have student table in the mydb database which we will be importing into HDFS.

The command for importing table is:

sqoop import --connect jdbc:mysql://localhost/mydb --username root --password root --table root -m 1 --target-dir /sqoop_import_01

As you can see in the below images, after executing this command SQL statement and Map tasks will be executed at the back end.

After the command is executed, you can check the HDFS target folder (in our case sqoop_import_01) where the data is imported.

Import control arguments:

Argument	Description
`--append`	Append data to an existing dataset in HDFS
`--as-avrodatafile`	Imports data to Avro Data Files
`--as-sequencefile`	Imports data to SequenceFiles
`--as-textfile`	Imports data as plain text (default)
`--as-parquetfile`	Imports data to Parquet Files
`--boundary-query <statement>`	Boundary query to use for creating splits
`--columns <col,col,col…>`	Columns to import from table
`--delete-target-dir`	Delete the import target directory if it exists
`--direct`	Use direct connector if exists for the database
`--fetch-size <n>`	Number of entries to read from database at once.
`--inline-lob-limit <n>`	Set the maximum size for an inline LOB
`-m,--num-mappers <n>`	Use n map tasks to import in parallel
`-e,--query <statement>`	Import the results of `statement`.
`--split-by <column-name>`	Column of the table used to split work units. Cannot be used with `--autoreset-to-one-mapper` option.
`--split-limit <n>`	Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
`--autoreset-to-one-mapper`	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with `--split-by <col>` option.
`--table <table-name>`	Table to read
`--target-dir <dir>`	HDFS destination dir
`--temporary-rootdir <dir>`	HDFS directory for temporary files created during import (overrides default “_sqoop”)
`--warehouse-dir <dir>`	HDFS parent for table destination
`--where <where clause>`	WHERE clause to use during import
`-z,--compress`	Enable compression
`--compression-codec <c>`	Use Hadoop codec (default gzip)
`--null-string <null-string>`	The string to be written for a null value for string columns
`--null-non-string <null-string>`	The string to be written for a null value for non-string columns

Now, try to increase the no. of map task from 1 to 2 in the query to run it in 2 parallel threads and let’s see the result.

sqoop import --connect jdbc:mysql://localhost/mydb --username root --password root --table root -m 2 --target-dir /sqoop_import_02

So, as you can see, if more than 1 map task (-m 2) is used in the import command than table should have a primary key or specify –split-by in your command.

There are lots of other import arguments available which you can use as per your requirement.

SQOOP Export Command

The sqoop export command exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.

The default operation is to transform these into a set of INSERT statements that inject the records into the database. In “update mode,” Sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” Sqoop will make a stored procedure call for each record.

Syntax
$ sqoop export (generic-args) (export-args) 
$ sqoop-export (generic-args) (export-args)

So, first we are creating an empty table, where we will export our data.

Creating Table for Sqoop Export - Apache Sqoop Tutorial - Edureka

The command to export data from HDFS to the relational database is:

sqoop export --connect jdbc:mysql://localhost/mydb --username root --password root --table emp --export-dir /sqoop_export_01

Data in Table after Sqoop Export - Apache Sqoop Tutorial - Edureka

Export control arguments:

Argument	Description
`--columns <col,col,col…>`	Columns to export to table
`--direct`	Use direct export fast path
`--export-dir <dir>`	HDFS source path for the export
`-m,--num-mappers <n>`	Use n map tasks to export in parallel
`--table <table-name>`	Table to populate
`--call <stored-proc-name>`	Stored Procedure to call
`--update-key <col-name>`	Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
`--update-mode <mode>`	Specify how updates are performed when new rows are found with non-matching keys in database.
	Legal values for `mode` include `updateonly` (default) and `allowinsert`.
`--input-null-string <null-string>`	The string to be interpreted as null for string columns
`--input-null-non-string <null-string>`	The string to be interpreted as null for non-string columns
`--staging-table <staging-table-name>`	The table in which data will be staged before being inserted into the destination table.
`--clear-staging-table`	Indicates that any data present in the staging table can be deleted.
`--batch`	Use batch mode for underlying statement execution.

To import or export, the order of columns in both MySQL and Hive should be the same.

I hope you have enjoyed this post and it helped you to understand in sqoop import and export command. Please like and share and feel free to comment if you have any suggestions or feedback.

Pradeep Mishra

April 4, 2021

Argument	Description
`--append`	Append data to an existing dataset in HDFS
`--as-avrodatafile`	Imports data to Avro Data Files
`--as-sequencefile`	Imports data to SequenceFiles
`--as-textfile`	Imports data as plain text (default)
`--as-parquetfile`	Imports data to Parquet Files
`--boundary-query <statement>`	Boundary query to use for creating splits
`--columns <col,col,col…>`	Columns to import from table
`--delete-target-dir`	Delete the import target directory if it exists
`--direct`	Use direct connector if exists for the database
`--fetch-size <n>`	Number of entries to read from database at once.
`--inline-lob-limit <n>`	Set the maximum size for an inline LOB
`-m,--num-mappers <n>`	Use n map tasks to import in parallel
`-e,--query <statement>`	Import the results of `statement`.
`--split-by <column-name>`	Column of the table used to split work units. Cannot be used with `--autoreset-to-one-mapper` option.
`--split-limit <n>`	Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
`--autoreset-to-one-mapper`	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with `--split-by <col>` option.
`--table <table-name>`	Table to read
`--target-dir <dir>`	HDFS destination dir
`--temporary-rootdir <dir>`	HDFS directory for temporary files created during import (overrides default “_sqoop”)
`--warehouse-dir <dir>`	HDFS parent for table destination
`--where <where clause>`	WHERE clause to use during import
`-z,--compress`	Enable compression
`--compression-codec <c>`	Use Hadoop codec (default gzip)
`--null-string <null-string>`	The string to be written for a null value for string columns
`--null-non-string <null-string>`	The string to be written for a null value for non-string columns

Argument	Description
`--columns <col,col,col…>`	Columns to export to table
`--direct`	Use direct export fast path
`--export-dir <dir>`	HDFS source path for the export
`-m,--num-mappers <n>`	Use n map tasks to export in parallel
`--table <table-name>`	Table to populate
`--call <stored-proc-name>`	Stored Procedure to call
`--update-key <col-name>`	Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
`--update-mode <mode>`	Specify how updates are performed when new rows are found with non-matching keys in database.
	Legal values for `mode` include `updateonly` (default) and `allowinsert`.
`--input-null-string <null-string>`	The string to be interpreted as null for string columns
`--input-null-non-string <null-string>`	The string to be interpreted as null for non-string columns
`--staging-table <staging-table-name>`	The table in which data will be staged before being inserted into the destination table.
`--clear-staging-table`	Indicates that any data present in the staging table can be deleted.
`--batch`	Use batch mode for underlying statement execution.

Generate Sequential and Unique IDs in a Spark Dataframe

Simple and Powerful ReverseProxy in Go

◀ Previous Post

Next Post ▶

by Pradeep Mishra

February 9, 2022

Apache Hive 3 Changes in CDP Upgrade: Part-2

In continuation of the previous blog Part-1, I am going to mention some other Hive changes that have been introduced in the Hadoop CDP (Cloudera Data Platform) upgrade. Beeline Hive. read more…

February 7, 2022

Apache Hive 3 Changes in CDP Upgrade: Part-1

INTRODUCTION Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and governance. read more…

SQOOP : Data transfer between Hadoop and RDBMS

Pradeep Mishra

Share post:

Why Sqoop?

SQOOP IMPORT Command

SQOOP Export Command

Like this:

Apache Hive 3 Changes in CDP Upgrade: Part-2

Like this:

Apache Hive 3 Changes in CDP Upgrade: Part-1

Like this:

Leave a ReplyCancel reply

Recent Posts

Categories

Recent Posts

Categories

Find Us

SQOOP : Data transfer between Hadoop and RDBMS

Pradeep Mishra

Share post:

Why Sqoop?

SQOOP IMPORT Command

SQOOP Export Command

Share this:

Like this:

Apache Hive 3 Changes in CDP Upgrade: Part-2

Share this:

Like this:

Apache Hive 3 Changes in CDP Upgrade: Part-1

Share this:

Like this:

Leave a ReplyCancel reply

Recent Posts

Categories

Tags

Recent Posts

Categories

Tags

Find Us