Thursday 18 February 2016

Create Identity column in Hive for dump in Hbase


  • We need to write a UDF for this.

package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;

@Description(name = "row_sequence",
    value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)

public class UDFRowSequence extends UDF
{
  private LongWritable result = new LongWritable();

  public UDFRowSequence() {
    result.set(0);
  }

  public LongWritable evaluate() {
    result.set(result.get() + 1);
    return result;
  }
}


  • Create a jar for this.
  • Create a Hive managed Hbase table with id as column for auto increment.

create table ga_tblagetest(id String,userAgeBracket String, date String,sessions String,percentNewSessions String,newUsers String,bounceRate String,pageviewsPerSession String,avgSessionDuration String)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:userAgeBracket,cf1:date,cf1:sessions,cf1:percentNewSessions,cf1:newUsers,cf1:bounceRate,cf1:pageviewsPerSession,cf1:avgSessionDuration")
TBLPROPERTIES ("hbase.table.name" = "ga_tblagetest");


  • In hive, add the jar : ADD JAR /path/customhiveudf.jar
  • create temporary function incrementrow as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
  • CREATE TEMPORARY TABLE default.tmp2 AS SELECT incrementrow(),* FROM socialmedia.tblage;
  • INSERT OVERWRITE TABLE ga_tblagetest SELECT * FROM default.tmp2;
Data with row key can be queried in Hbase.
  • scan 'ga_tblagetest'

Friday 12 February 2016

Script for getting count and name of table from Hive in a text file

OUTPUT="$(hive -e 'use oxi_consolidated; show tables;')"
hive -e "select count(*) from ${OUTPUT}" > countoxi.txt

Execute all .hql files inside a directory:

oxi_array=(`find /home/oxigen/HiveStartup/ -type f -follow` )
for element in "${oxi_array[@]}"
do
  hive -f "${element}"
done

Sunday 7 February 2016

Connect Linux HDInsight cluster with PowerBi/Tableau


  1.  Download and install Microsoft Hive ODBC Driver.
    1. https://www.microsoft.com/en-us/download/confirmation.aspx?id=40886
  2. Open ODBC Data source Administrator





































3.  Connected !!



Thursday 4 February 2016

Shell Script for automating GA

todaysdate=`date +"%Y-%m-%d"`
path="/home/brillio/GoogleAnalytics/CSVData/"
dir="/home/brillio/GoogleAnalytics"
jarname="googleanalytics"
hdfspath="/socialmedia/googleanalytics/"
if
test -d $path ;
then
sudo rm -r $path*
else echo "FileNotExist"
fi
java -jar $jarname.jar
sudo chmod -R 777 $dir
tablename=( "tblAge" "tblBehaviourAndNetwork" "tblGender" "tblGeo" "tblTechnologyBrowser" "tblTechnologyOperatingSystem" "tblLanguage" )
for table in ${tablename[@]}
do
if
hadoop fs -test -f $hdfspath$table/$table\_$todaysdate\.csv ;
then
hadoop fs -rm -r $hdfspath$table/$table\_$todaysdate\.csv
else echo "FileNotExist"
fi
hadoop fs -copyFromLocal $path$table\_* $hdfspath$table/
done

Tuesday 2 February 2016

Monitor Spark Processes (List spark processes running)

If you are running spark process over yarn cluster then there  is a way to list the spark processes running

yarn application -list

Spark Streaming Process

To kill a running process :

 yarn application -kill application_id


Copy Startup Files from Azure Blob to Local everytime the cluster is created Using Powershell

$servername = "xxxxxxxxxxxxxxxxxxxxx"
$username = "xxxxxxxxx"
$password = "xxxxxxxxxxxxx"
$port = 22
$foldername = @("HueScripts","completereload")
Import-Module SSH-Sessions
New-SshSession -ComputerName $servername -Username $username -Password $password -Port $port

foreach ($f in $foldername) {
   $command1 = "test -d `"/home/brillio/$f`" && sudo rm -R /home/brillio/$f || echo Does Not Exist Proceeding..."
   $SshResults1 = Invoke-SshCommand -InvokeOnAll -Command $command1
   $command2 = "sudo mkdir /home/brillio/$f"
   $SshResults2 = Invoke-SshCommand -InvokeOnAll -Command $command2
   $command3 = "sudo chmod -R 777 /home/brillio/$f"
   $SshResults3 = Invoke-SshCommand -InvokeOnAll -Command $command3
   $command4 = "hadoop fs -copyToLocal wasb://xxxxxxx@xxxxxxxxxxx/$f/* /home/brillio/$f/"
   $SshResults4 = Invoke-SshCommand -InvokeOnAll -Command $command4
   $command5 = "sudo chmod -R 777 /home/brillio/$f"
   $SshResults5 = Invoke-SshCommand -InvokeOnAll -Command $command5
   Write-Host "Done $f"
 }

 $command6 = "/home/brillio/hueurl.sh"
 $SshResults6 = Invoke-SshCommand -InvokeOnAll -Command $command6

Remove-SshSession -RemoveAll

How to pass Parameter in HIVE hql



hive -e 'create database ${db}' -define db='twitter_stream'


OR

if you are calling hive hql in shell script  :

hive -e 'create database ${db}' -define db=$1

Monday 1 February 2016

Catch Logs In Sqoop


To output only in log file and not in UI:

/home/brillio/sqoop-jobs/test.sh > /home/brillio/logsqoop.txt 2>&1

To output in UI and log file:

/home/brillio/sqoop-jobs/test.sh 2>&1 | tee /home/brillio/logsqoop.txt