Simple and effective Workarounds !: February 2016

Thursday 18 February 2016

Create Identity column in Hive for dump in Hbase

We need to write a UDF for this.

package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;

@Description(name = "row_sequence",
value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)

public class UDFRowSequence extends UDF
{
private LongWritable result = new LongWritable();

public UDFRowSequence() {
result.set(0);
}

public LongWritable evaluate() {
result.set(result.get() + 1);
return result;
}
}

Create a jar for this.
Create a Hive managed Hbase table with id as column for auto increment.

create table ga_tblagetest(id String,userAgeBracket String, date String,sessions String,percentNewSessions String,newUsers String,bounceRate String,pageviewsPerSession String,avgSessionDuration String)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:userAgeBracket,cf1:date,cf1:sessions,cf1:percentNewSessions,cf1:newUsers,cf1:bounceRate,cf1:pageviewsPerSession,cf1:avgSessionDuration")
TBLPROPERTIES ("hbase.table.name" = "ga_tblagetest");

In hive, add the jar : ADD JAR /path/customhiveudf.jar
create temporary function incrementrow as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
CREATE TEMPORARY TABLE default.tmp2 AS SELECT incrementrow(),* FROM socialmedia.tblage;
INSERT OVERWRITE TABLE ga_tblagetest SELECT * FROM default.tmp2;

Data with row key can be queried in Hbase.

scan 'ga_tblagetest'

Friday 12 February 2016

Script for getting count and name of table from Hive in a text file

OUTPUT="$(hive -e 'use oxi_consolidated; show tables;')"
hive -e "select count(*) from ${OUTPUT}" > countoxi.txt

Execute all .hql files inside a directory:

oxi_array=(`find /home/oxigen/HiveStartup/ -type f -follow` )
for element in "${oxi_array[@]}"
do
hive -f "${element}"
done

Sunday 7 February 2016

Connect Linux HDInsight cluster with PowerBi/Tableau

Download and install Microsoft Hive ODBC Driver.

https://www.microsoft.com/en-us/download/confirmation.aspx?id=40886

Open ODBC Data source Administrator

3. Connected !!

Thursday 4 February 2016

Shell Script for automating GA

todaysdate=`date +"%Y-%m-%d"`
path="/home/brillio/GoogleAnalytics/CSVData/"
dir="/home/brillio/GoogleAnalytics"
jarname="googleanalytics"
hdfspath="/socialmedia/googleanalytics/"
if
test -d $path ;
then
sudo rm -r $path*
else echo "FileNotExist"
fi
java -jar $jarname.jar
sudo chmod -R 777 $dir
tablename=( "tblAge" "tblBehaviourAndNetwork" "tblGender" "tblGeo" "tblTechnologyBrowser" "tblTechnologyOperatingSystem" "tblLanguage" )
for table in ${tablename[@]}
do
if
hadoop fs -test -f $hdfspath$table/$table\_$todaysdate\.csv ;
then
hadoop fs -rm -r $hdfspath$table/$table\_$todaysdate\.csv
else echo "FileNotExist"
fi
hadoop fs -copyFromLocal $path$table\_* $hdfspath$table/
done

Tuesday 2 February 2016

Monitor Spark Processes (List spark processes running)

If you are running spark process over yarn cluster then there is a way to list the spark processes running

yarn application -list

Spark Streaming Process

To kill a running process :

yarn application -kill application_id

Copy Startup Files from Azure Blob to Local everytime the cluster is created Using Powershell

$servername = "xxxxxxxxxxxxxxxxxxxxx"
$username = "xxxxxxxxx"
$password = "xxxxxxxxxxxxx"
$port = 22
$foldername = @("HueScripts","completereload")
Import-Module SSH-Sessions
New-SshSession -ComputerName $servername -Username $username -Password $password -Port $port

foreach ($f in $foldername) {
$command1 = "test -d `"/home/brillio/$f`" && sudo rm -R /home/brillio/$f || echo Does Not Exist Proceeding..."
$SshResults1 = Invoke-SshCommand -InvokeOnAll -Command $command1
$command2 = "sudo mkdir /home/brillio/$f"
$SshResults2 = Invoke-SshCommand -InvokeOnAll -Command $command2
$command3 = "sudo chmod -R 777 /home/brillio/$f"
$SshResults3 = Invoke-SshCommand -InvokeOnAll -Command $command3
$command4 = "hadoop fs -copyToLocal wasb://xxxxxxx@xxxxxxxxxxx/$f/* /home/brillio/$f/"
$SshResults4 = Invoke-SshCommand -InvokeOnAll -Command $command4
$command5 = "sudo chmod -R 777 /home/brillio/$f"
$SshResults5 = Invoke-SshCommand -InvokeOnAll -Command $command5
Write-Host "Done $f"
}

$command6 = "/home/brillio/hueurl.sh"
$SshResults6 = Invoke-SshCommand -InvokeOnAll -Command $command6

Remove-SshSession -RemoveAll

How to pass Parameter in HIVE hql

hive -e 'create database ${db}' -define db='twitter_stream'

OR

if you are calling hive hql in shell script :

hive -e 'create database ${db}' -define db=$1

Monday 1 February 2016

Catch Logs In Sqoop

To output only in log file and not in UI:

/home/brillio/sqoop-jobs/test.sh > /home/brillio/logsqoop.txt 2>&1

To output in UI and log file:

/home/brillio/sqoop-jobs/test.sh 2>&1 | tee /home/brillio/logsqoop.txt