SQOOP : Data transfer between Hadoop and RDBMS

Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data between HDFS (Hadoop storage) and RDBMS (relational database) servers like SQLite, Oracle, MySQL, Netezza, Teradata, Postgres etc. Apache Sqoop imports data from relational databases to HDFS, and exports data from HDFS to relational databases. It efficiently transfers bulk data between Hadoop and external data stores such as enterprise data warehouses, relational databases, etc.

This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”.

Additionally, Sqoop is used to import data from external datastores into Hadoop ecosystem’s tools like Hive & HBase.

An in-depth introduction to SQOOP architecture
SQOOP Architecture
Why Sqoop?

When the data residing in the relational database management systems need to be transferred to HDFS. The task of writing MapReduce code for importing and exporting data from the relational database to HDFS is uninteresting & tedious. This is where Apache Sqoop comes to rescue and removes their pain. It automates the process of importing & exporting the data.

Sqoop makes the life of developers easy by providing CLI for importing and exporting data. They just have to provide basic information like database authentication, source, destination, operations etc. It takes care of the remaining part.

Sqoop internally converts the command into MapReduce tasks, which are then executed over HDFS. It uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.

Mostly two sqoop commands are used by programmers which are sqoop import and export. Let’s discuss them one by one.

SQOOP IMPORT Command

Sqoop import command imports a table from an RDBMS to HDFS. In our case, we are going to import tables from MySQL databases to HDFS.

Syntax
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

As you can see in the below image, we have student table in the mydb database which we will be importing into HDFS.  

The command for importing table is:

sqoop import --connect jdbc:mysql://localhost/mydb --username root --password root --table root -m 1 --target-dir /sqoop_import_01

As you can see in the below images,  after executing this command SQL statement and Map tasks will be executed at the back end. 

After the command is executed, you can check the HDFS target folder (in our case sqoop_import_01) where the data is imported.

Import control arguments:

ArgumentDescription
--appendAppend data to an existing dataset in HDFS
--as-avrodatafileImports data to Avro Data Files
--as-sequencefileImports data to SequenceFiles
--as-textfileImports data as plain text (default)
--as-parquetfileImports data to Parquet Files
--boundary-query <statement>Boundary query to use for creating splits
--columns <col,col,col…>Columns to import from table
--delete-target-dirDelete the import target directory if it exists
--directUse direct connector if exists for the database
--fetch-size <n>Number of entries to read from database at once.
--inline-lob-limit <n>Set the maximum size for an inline LOB
-m,--num-mappers <n>Use n map tasks to import in parallel
-e,--query <statement>Import the results of statement.
--split-by <column-name>Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
--split-limit <n>Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
--autoreset-to-one-mapperImport should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.
--table <table-name>Table to read
--target-dir <dir>HDFS destination dir
--temporary-rootdir <dir>HDFS directory for temporary files created during import (overrides default “_sqoop”)
--warehouse-dir <dir>HDFS parent for table destination
--where <where clause>WHERE clause to use during import
-z,--compressEnable compression
--compression-codec <c>Use Hadoop codec (default gzip)
--null-string <null-string>The string to be written for a null value for string columns
--null-non-string <null-string>The string to be written for a null value for non-string columns

Now, try to increase the no. of map task from 1 to 2 in the query to run it in 2 parallel threads and let’s see the result.

sqoop import --connect jdbc:mysql://localhost/mydb --username root --password root --table root -m 2 --target-dir /sqoop_import_02

So, as you can see, if more than 1 map task (-m 2) is used in the import command than table should have a primary key or specify –split-by in your command.

There are lots of other import arguments available which you can use as per your requirement.

SQOOP Export Command

The sqoop export command exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.

The default operation is to transform these into a set of INSERT statements that inject the records into the database. In “update mode,” Sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” Sqoop will make a stored procedure call for each record.

Syntax
$ sqoop export (generic-args) (export-args) 
$ sqoop-export (generic-args) (export-args)

So, first we are creating an empty table, where we will export our data.

Creating Table for Sqoop Export - Apache Sqoop Tutorial - Edureka

The command to export data from HDFS to the relational database is:

sqoop export --connect jdbc:mysql://localhost/mydb --username root --password root --table emp --export-dir /sqoop_export_01
Data in Table after Sqoop Export - Apache Sqoop Tutorial - Edureka

Export control arguments:

ArgumentDescription
--columns <col,col,col…>Columns to export to table
--directUse direct export fast path
--export-dir <dir>HDFS source path for the export
-m,--num-mappers <n>Use n map tasks to export in parallel
--table <table-name>Table to populate
--call <stored-proc-name>Stored Procedure to call
--update-key <col-name>Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
--update-mode <mode>Specify how updates are performed when new rows are found with non-matching keys in database.
Legal values for mode include updateonly (default) and allowinsert.
--input-null-string <null-string>The string to be interpreted as null for string columns
--input-null-non-string <null-string>The string to be interpreted as null for non-string columns
--staging-table <staging-table-name>The table in which data will be staged before being inserted into the destination table.
--clear-staging-tableIndicates that any data present in the staging table can be deleted.
--batchUse batch mode for underlying statement execution.

To import or export, the order of columns in both MySQL and Hive should be the same.

I hope you have enjoyed this post and it helped you to understand in sqoop import and export command. Please like and share and feel free to comment if you have any suggestions or feedback.

Tagged : / / / / / /

Leave a Reply

%d bloggers like this: