Thursday 17 December 2015

Working with Sqoop


  • Used to import data from traditional RDBMS to HDFS/Hive/HBase etc and vice-versa
  • Best approach for filtering :
    • Run Query in RDBMS -> Create a temp table there -> Import this temp table using Sqoop.
  • Password passed in Sqoop Query?
    • Use -P : Prompts user to enter password.
    • Save password in a file -> in query mention: -- password-file
  • Default, outputs CSV file in HDFS after import:
    • Avro support : -- as-avrodatafile
    • SequenceFile : -- as-sequencefile
  • Compression Support :
    • --compress --compression-codec..............
    • Splittable : Bzip2, LZO
    • Not Splittable: GZip,Snappy
  • For faster transfer:
    • --direct : Supported for MySql, PostGreSql
  • -- map-column-java col1=String,col2=Float  (Change Col type while importing from RDBMS)
  • CSV output file, does not handle BLANK values well, so.
    • If colType = VARCHAR,CHAR,NCHAR,TEXT 
      • --null-string '\\N'
    • If any other colType
      • --null-non-string '\\N'
  • Import all tables from a DB? 
    • sqoop import-all-tables 
    • Tables imported in sequential order
    • option of --exclude-tables
    • Cannot use --target-dir instead --warehouse-dir is fine.
  • Incremental in Sqoop:
    • When getting new rows and existing is not change:
      • Use --incemental append --check-column id --last-value 0
    • When data is changed:
      • Use --incremental lastmodified --check-column  --last-value
  • Create Sqoop job for automatic pickup of last-value:
    • sqoop job --create name_of_job --import --connect...............
    • sqoop job --list
    • sqoop job --exec name_of_job
      • Sqoop will searialise last imported value back to metastore after each successful incremental job
  • Use boundary query for optimization:
    • --username sqoop --password sqoop --query 'SELECT normcities.id, countries.country, normcities.city FROM normcities JOIN countries USING(country_id) WHERE $CONDITIONS' --split-by id --target-dir cities --boundary-query "select min(id), max(id) from normcities"

No comments:

Post a Comment