Saturday, 26 December 2015

Running Process in Backend in Linux Environment for indefinite Time

Sometimes we need to run some process in backend continously once we invoke the process.
There is one commend for that in linux nohup

Example :

nohup command-to-execute

nohup jupyter-notebook

Thursday, 24 December 2015

How to create External Table in Hive with Partitions And Load Data ?

Start with Create Table 


Create EXTERNAL TABLE Countries(
Id TINYINT,
Country String,
udate String,
UPDATE_DT String,
ACTIVE_FLAG String)
PARTITIONED BY (INSERT_DT String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','

Location '/training/test/';

Now table is create d in Hive but data is still not in hive tables.

Data can be loaded into partitions table in two ways :

1)Static partitions Insert
2) Dynamic Partition Insert

1.Static Partitions Insert

LOAD Data into partiton table by giving path of file
ALTER TABLE Countries ADD PARTITION(INSERT_DT='25-12-2015')
LOCATION '/training/test/25-12-2015';


In case you want to delete or drop partitions from existing table :

ALTER TABLE Countries PARTITION(INSERT_DT='25-12-2015')
SET LOCATION '/training/test/25th';

Note :This won't delete the existing data .it will simply change the location of partiton data.

2. Dynamic Partitions

create a new Hive Table

Create EXTERNAL TABLE Countries_dynamic(
Id TINYINT,
Country String,
udate String,
UPDATE_DT String,
ACTIVE_FLAG String)
PARTITIONED BY (INSERT_DT String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','

Location '/training/test_dynamic/';

Set following features :

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;

Now to add data in partitions run following query :

INSERT OVERWRITE TABLE Countries_dynamic
PARTITION (dt) SELECT Id  ,
Country ,udate,UPDATE_DT,ACTIVE_FLAG ,INSERT_DT as dt

FROM Countries ;





Tuesday, 22 December 2015

Execute a Sqoop Job using Windows Power Shell

#region - provide the following values

$subscriptionID = "XXXXXXXXXXXXXXXXXXXXXXXXXXX"

#region - variables

# Resource group variables
$resourceGroupName = "XXXXXXXXXXX"
$location = "XXXXXXXXXXX" # used by all Azure services defined in this tutorial


# HDInsight variables
$hdinsightClusterName = "XXXXXXXXXXXXXXXXX"
$defaultStorageAccountName = "XXXXXXXXXXXXXXXXXXX"
$defaultBlobContainerName = "XXXXXXXXXXXXXXXXXXXX"
$defaultStorageAccountKey=Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -Name $defaultStorageAccountName | %{ $_.Key1 }


$username = "XXXXXXXXXX"
$password = " XXXXXXXXXX" | ConvertTo-SecureString -AsPlainText -Force
$httpCredential = New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $username, $password
#endregion

#region - Connect to Azure subscription
Write-Host "`nConnecting to your Azure subscription ..." -ForegroundColor Green
try{Get-AzureRmContext}
catch{Login-AzureRmAccount}
#endregion

#region - Create Azure resouce group
Write-Host "`nCreating an Azure resource group ..." -ForegroundColor Green
try{
    Get-AzureRmResourceGroup -Name $resourceGroupName
}
catch{
    New-AzureRmResourceGroup -Name $resourceGroupName -Location $location
}
#endregion

# Enter Table Name
$tableName_log4j = "XXXXXXXXXXXXXX"

# Connection string for Remote SQL Database.

$connectionString = "jdbc:sqlserver://sqlserver-vm1.cloudapp.net:1433;database=XXXXXXXXXXXXX;username=XXXXXXX;password=XXXXXXXX"

# Submit a Sqoop job
$sqoopDef = New-AzureRmHDInsightSqoopJobDefinition `
    -Command "import --connect $connectionString --table $tableName_log4j --num-mappers 32 --null-string '\\N' --null-non-string '\\N' --target-dir  /sqlserverdump/XXXXXXXXXXX"
$sqoopJob = Start-AzureRmHDInsightJob `
                -ClusterName $hdinsightClusterName `
                -HttpCredential $httpCredential `
                -JobDefinition $sqoopDef #-Debug -Verbose
Wait-AzureRmHDInsightJob `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $hdinsightClusterName `
    -HttpCredential $httpCredential `
    -JobId $sqoopJob.JobId

Write-Host "Standard Error" -BackgroundColor Green
Get-AzureRmHDInsightJobOutput -ResourceGroupName $resourceGroupName -ClusterName $hdinsightClusterName -DefaultStorageAccountName $defaultStorageAccountName -DefaultStorageAccountKey $defaultStorageAccountKey -DefaultContainer $defaultBlobContainerName -HttpCredential $httpCredential -JobId $sqoopJob.JobId -DisplayOutputType StandardError
Write-Host "Standard Output" -BackgroundColor Green
Get-AzureRmHDInsightJobOutput -ResourceGroupName $resourceGroupName -ClusterName $hdinsightClusterName -DefaultStorageAccountName $defaultStorageAccountName -DefaultStorageAccountKey $defaultStorageAccountKey -DefaultContainer $defaultBlobContainerName -HttpCredential $httpCredential -JobId $sqoopJob.JobId -DisplayOutputType StandardOutput

#endregion

Thursday, 17 December 2015

Working with Sqoop


  • Used to import data from traditional RDBMS to HDFS/Hive/HBase etc and vice-versa
  • Best approach for filtering :
    • Run Query in RDBMS -> Create a temp table there -> Import this temp table using Sqoop.
  • Password passed in Sqoop Query?
    • Use -P : Prompts user to enter password.
    • Save password in a file -> in query mention: -- password-file
  • Default, outputs CSV file in HDFS after import:
    • Avro support : -- as-avrodatafile
    • SequenceFile : -- as-sequencefile
  • Compression Support :
    • --compress --compression-codec..............
    • Splittable : Bzip2, LZO
    • Not Splittable: GZip,Snappy
  • For faster transfer:
    • --direct : Supported for MySql, PostGreSql
  • -- map-column-java col1=String,col2=Float  (Change Col type while importing from RDBMS)
  • CSV output file, does not handle BLANK values well, so.
    • If colType = VARCHAR,CHAR,NCHAR,TEXT 
      • --null-string '\\N'
    • If any other colType
      • --null-non-string '\\N'
  • Import all tables from a DB? 
    • sqoop import-all-tables 
    • Tables imported in sequential order
    • option of --exclude-tables
    • Cannot use --target-dir instead --warehouse-dir is fine.
  • Incremental in Sqoop:
    • When getting new rows and existing is not change:
      • Use --incemental append --check-column id --last-value 0
    • When data is changed:
      • Use --incremental lastmodified --check-column  --last-value
  • Create Sqoop job for automatic pickup of last-value:
    • sqoop job --create name_of_job --import --connect...............
    • sqoop job --list
    • sqoop job --exec name_of_job
      • Sqoop will searialise last imported value back to metastore after each successful incremental job
  • Use boundary query for optimization:
    • --username sqoop --password sqoop --query 'SELECT normcities.id, countries.country, normcities.city FROM normcities JOIN countries USING(country_id) WHERE $CONDITIONS' --split-by id --target-dir cities --boundary-query "select min(id), max(id) from normcities"

Monday, 14 December 2015

Load CSV data to Hbase Using Pig


  1. Open hbase-shell
  2. Create a table:
    • create 'mydata1','mycf'
  3. Open pig shell
  4. A = LOAD '/lokesh/hbasetest.txt' USING PigStorage(',') as (strdata:chararray, intdata:long);
  5. STORE A INTO 'hbase://mydata1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:intdata');
  6. Done!!!!!!


Wednesday, 9 December 2015

Load data from MYSQL and dump to S3 using SPARK


RDBMS IMPORT



***** Using Python *******

pyspark --jars /mnt/resource/lokeshtest/guava-12.0.1.jar,/mnt/resource/lokeshtest/hadoop-aws-2.6.0.jar,/mnt/resource/lokeshtest/aws-java-sdk-1.7.3.jar,/mnt/resource/lokeshtest/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar --packages com.databricks:spark-csv_2.10:1.2.0

from pyspark import SQLContext

sqlcontext=SQLContext(sc)

dataframe_mysql = sqlcontext.read.format("jdbc").options(url="jdbc:mysql://YOUR_PUBLIC IP:3306/DB_NAME",driver = "com.mysql.jdbc.Driver",dbtable = "TBL_NAME",user="sqluser",password="sqluser").load()

dataframe_mysql.show()



****** Using Scala *******

sudo -u root spark-shell --jars /mnt/resource/lokeshtest/guava-12.0.1.jar,/mnt/resource/lokeshtest/hadoop-aws-2.6.0.jar,/mnt/resource/lokeshtest/aws-java-sdk-1.7.3.jar,/mnt/resource/lokeshtest/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar --packages com.databricks:spark-csv_2.10:1.2.0

import org.apache.spark.sql.SQLContext

val sqlcontext = new org.apache.spark.sql.SQLContext(sc)

val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://YOUR_PUBLIC IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "TBL_NAME").option("user", "sqluser").option("password", "sqluser").load()

dataframe_mysql.show()


**********************************************************************************************************************************************************

****** Using Scala *******

Persist in Mem Cache:

dataframe_mysql.cache

Perform some transformation or filter on df using map, etc.

val filter_gta = dataframe_mysql.filter(dataframe_mysql("date") === "20151129")

Optional: Repartition Data:

filter_gta.repartition(1)

Save to S3 as CSV:

filter_gta.write.format("com.databricks.spark.csv").option("header","true").save("s3n://YOUR_KEY:YOUR_SECRET@BUCKET_NAME/resources/spark-csv/mysqlimport1.csv")


************************************************************************************************************************************************************

Spark and AWS Sample Code

Load CSV data to Amazon S3

***** Using Python *******

pyspark --jars /mnt/resource/lokeshtest/guava-12.0.1.jar,/mnt/resource/lokeshtest/hadoop-aws-2.6.0.jar,/mnt/resource/lokeshtest/aws-java-sdk-1.7.3.jar --packages com.databricks:spark-csv_2.10:1.2.0

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://YOUR_KEY:YOUR_SECRET@BUCKET_NAME/resources/spark-csv/sparksamplecsv.csv')

print df.show()



****** Using Scala *******

sudo -u root spark-shell --jars /mnt/resource/lokeshtest/guava-12.0.1.jar,/mnt/resource/lokeshtest/hadoop-aws-2.6.0.jar,/mnt/resource/lokeshtest/aws-java-sdk-1.7.3.jar --packages com.databricks:spark-csv_2.10:1.2.0

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("s3n://YOUR_KEY:YOUR_SECRET@BUCKET_NAME/resources/spark-csv/sparksamplecsv.csv")

df.show()


**********************************************************************************************************************************************************

Monday, 7 December 2015

For using AWS in Spark

Add following jars at runtime in spark-classpath :

1. hadoop-aws-2.6.0.jar
2. aws-java-sdk-1.7.3.jar
3.guava-12.0.1.jar

spark-shell --jars  hadoop-aws-2.6.0.jar,aws-java-sdk-1.7.3.jar,guava-12.0.1.jar  (for spark with scala)
pyspark --jars  hadoop-aws-2.6.0.jar,aws-java-sdk-1.7.3.jar,guava-12.0.1.jar  (for spark with python)


Monday, 9 November 2015

How to Turn Off Log4j Warnings


Set this in your code :
Logger.getRootLogger().setLevel(Level.OFF);

Monday, 26 October 2015

HCatalog Basics


  • HCatalog is an extension of Hive, that exposes the Hive metadata to other tools and frameworks.
  • To define a HCatalog schema, one simply needs to define a table in Hive.
  • The usefulness of HCatalog is, when one needs to expose the schema outside of Hive i.e to other frameworks - ex : Pig
  • To load a table student, managed by HCatalog:
    • stu_table= LOAD 'student' USING org.apache.hcatalog.pig.HCatLoader();
      • the schema of stu_table is whatever the schema of student is.
  • Similarly, to store we use :
    • STORE stu_table INTO 'student' USING org.apache.hcatalog.pig.HCatStorer();


  • Using PIG shell, we can run Hive DDL command.
  • grunt> sql create table movies (
  •    title string,
  •    rating string,
  •    length double)
  • partitioned by (genre string)
  • stored as ORC;

Thursday, 22 October 2015

Twitter Kafka Integration Using Hortonworks

KAFKA-TWITTER IN HORTONWORKS

- Make sure kafka and zookeeper are running from yourhostname.cloudapp.net:8080
- Check if port for zookeeper is 2181 and kafka : 6667

- Create a Topic :

/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --zookeeper yourhostname.cloudapp.net:2181 --replication-factor 1 --partitions 1 --topic twitter-topic

- Verify if topic is created

/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --list --zookeeper yourhostname.cloudapp.net:2181

- Create a java mavaen project :

package SampleTwitterKafka;

import java.util.Properties;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;

import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;

import com.google.common.collect.Lists;
import com.twitter.hbc.ClientBuilder;
import com.twitter.hbc.core.Client;
import com.twitter.hbc.core.Constants;
import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
import com.twitter.hbc.core.processor.StringDelimitedProcessor;
import com.twitter.hbc.httpclient.auth.Authentication;
import com.twitter.hbc.httpclient.auth.OAuth1;

public class TwitterKafkaProducer {

private static final String topic = "twitter-topic";

public static void run(String consumerKey, String consumerSecret,
String token, String secret) throws InterruptedException {

Properties properties = new Properties();
properties.put("metadata.broker.list", "yourhostname.cloudapp.net:6667");
properties.put("serializer.class", "kafka.serializer.StringEncoder");
properties.put("client.id","camus");
ProducerConfig producerConfig = new ProducerConfig(properties);
kafka.javaapi.producer.Producer<String, String> producer = new kafka.javaapi.producer.Producer<String, String>(
producerConfig);

BlockingQueue<String> queue = new LinkedBlockingQueue<String>(10000);
StatusesFilterEndpoint endpoint = new StatusesFilterEndpoint();
// add some track terms
endpoint.trackTerms(Lists.newArrayList("#ALDUB14thWeeksary",
"#MagpasikatAnneKimEruption", "#happydussehra", "ItsShowtime DARREN"));

Authentication auth = new OAuth1(consumerKey, consumerSecret, token,
secret);
// Authentication auth = new BasicAuth(username, password);

// Create a new BasicClient. By default gzip is enabled.
Client client = new ClientBuilder().hosts(Constants.STREAM_HOST)
.endpoint(endpoint).authentication(auth)
.processor(new StringDelimitedProcessor(queue)).build();

// Establish a connection
client.connect();

// Do whatever needs to be done with messages
for (int msgRead = 0; msgRead < 1000; msgRead++) {
KeyedMessage<String, String> message = null;
try {
message = new KeyedMessage<String, String>(topic, queue.take());
} catch (InterruptedException e) {
e.printStackTrace();
}
producer.send(message);
}
producer.close();
client.stop();

}

public static void main(String[] args) {
try {
TwitterKafkaProducer.run("XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX", "XXXXXXXXXXXXXX");
} catch (InterruptedException e) {
System.out.println(e);
}
}
}


***********************************************************************************

POM File :

<dependencies>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>hbc-core</artifactId> <!-- or hbc-twitter4j -->
<version>2.2.0</version> <!-- or whatever the latest version is -->
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.8.0</artifactId>
<version>0.8.1.1</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.16</version>
<exclusions>
<exclusion>
<groupId>javax.jms</groupId>
<artifactId>jms</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.6.4</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>

</dependencies>

***********************************************************


- Change the fields, highlighted in yellow, in the java code, and create a  runnable jar along with entry to class name

- Copy the jar to the linux machine :  yourhostname.cloudapp.net

- Run jar in linux terminal:
   java -jar twitter-snapshotv1.jar

- Check for producer in a new Terminal :
   /usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --zookeeper yourhostname.cloudapp.net:2181 --topic twitter-topic --from-beginning


BINGO !!!

Wednesday, 7 October 2015

GUI Tool for Hive

One of the best tool for connecting to Hive:

http://www.aquafold.com/dbspecific/apache_hive_client.html


The setup is strightforward :

Points:

1. Get the public IP of your machine where HIVE is installed.

2. Check UserName and Password for hive. You can find this in Ambari portal if using hortonworks.

3. Give a Database name. (Use default, if no Database is created)

4. Test the connection.

5. Done !




Setup 4 Node Hadoop Cluster using Azure Subscription

So, if you have Azure Subscription, and planning to setup Hadoop Cluster, then I would strongly recommend, go through the link mentioned below:

http://blogs.technet.com/b/oliviaklose/archive/2014/06/17/hadoop-on-linux-on-azure-1.aspx


This is one of the best blog I have come across, and you can setup-up your 4 node cluster in less than 4 hours !


Sunday, 20 September 2015

Enable GUI for Ubuntu in Azure

Steps : 

1)   sudo apt-get update

2)   sudo apt-get install ubuntu-desktop (This step will take time, please be patient)

3)   sudo apt-get install xrdp

4)   sudo /etc/init.d/xrdp start

5)   sudo adduser brillio

6)   sudo adduser lokesh

7)   sudo adduser lokesh sudo

8)   Download the RDP from Azure and connect using username : lokesh(in my case) and password as given in step 6

Monday, 7 September 2015

Python Machine Learning Library installation


sudo easy_install pip
sudo pip install numpy
sudo easy_install gevent
sudo yum install g++ (for centOS)
sudo apt-get install g++(for ubuntu)
sudo pip install pandas
sudo pip install sklearn
sudo pip install nltk
sudo pip install geopy
sudo pip install tweepy
sudo pip install scipy

In case of Error in installing (Scipy)

raise NotFoundError('no lapack/blas resources found')(centOS)
sudo yum install python-devel
sudo yum install libevent-devel
sudo yum install scipy

raise NotFoundError('no lapack/blas resources found')(ubuntu)
sudo apt-get install gfortran libopenblas-dev liblapack-dev

Thursday, 3 September 2015

Check for port

# Sample example to check MySql port

sudo netstat -tlpn | grep mysql

 sudo netstat -plnt | grep ':61181'

Tuesday, 1 September 2015

Backup Table and Data in HBase

Use following command :

snapshot 'tweets1', 'tweets1-Snapshot'
clone_snapshot 'tweets1-Snapshot', 'newTable'


//tweets1 is the actual table name.
//newTable is the cloned table. Type in command  -- scan 'newTable'to verify.

Hbase Error:Resolved

su - hbase -c "/usr/hdp/current/hbase-master/bin/hbase-daemon.sh start master; sleep 25"

su - hbase -c "/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh start regionserver"

  1. Check processes.
    ps -ef | grep -i hmaster
    ps -ef | grep -i hregion

Copy Files from one Linux Machine to Another in Azure

Use the command below to copy :


scp -C -i myPrivateKey_rsa -r /home/brillio/move.txt brillio@your_target_hostname.cloudapp.net:/home/brillio/


hadoop distcp hdfs://your_hostname:8020/training/BSA hdfs://your_target_hostname:8020/training/BSA/

Tuesday, 25 August 2015

Automate Thrift Server Start/Stop in Hortonworks Hbase using shell script

1. Create a .sh file

2.      sudo kill -9 `sudo netstat -nlp|grep 9090 | awk '{print $7}' | cut -c1-5`
         hbase thrift start -threadpool

3. Copy the above lines.

4. Execute the shell script. -> /home/brillio/xxxxxxx.sh

5. Done.

Saturday, 22 August 2015

Connect Hbase in Hortonworks and Tableau

Step 1 : Start the rest server in Hortonworks Linux machine

             /usr/hdp/2.2.6.0-2800/hbase/bin/hbase-daemon.sh start rest -p 9768

Note: 9768 is port I have selected. make sure this port is free in your machine or else use a different port.

Step 2 : Download Simba ODBC Driver (in Windows Machine, where your tableau server is)

             http://www.simba.com/connectors/apache-hbase-odbc

Step 3 : Install it.

Step 4 :

Go to path highlighted in Yellow -> Add your Public Virtual IP ( you can get it from Azure VM -  Dashboard) . Enter port as used in command in step 1.

Step 5 : If you are using Azure , make sure in endpoints you have opened 9768 port.

Step 6 : Test the connection, it should show list of tables.

Step 7 : Connect with tableau using Other ODBC Connection. Select Simba as DNS and provide Server.

Step 8 : If it doesn't work, make sure you have copied Simba licence file, which you received in email to the installed location. (C:\Program Files\Simba HBase ODBC Driver\lib)

Friday, 21 August 2015

Connect to Hbase using Python Code and happybase

Step 1 : pip install happybase

Step 2 :

import happybase

#Pass public virtual IP, you can find it in Azure VM dashboard
connection = happybase.Connection('137.135.XX.XXX')

#Snippet to create table

#connection.create_table(
#mytablefrompythoncode',
#{'cf1': dict(max_versions=10),
#'cf2': dict(max_versions=1, block_cache_enabled=False),
#'cf3': dict(), # use defaults
#}
#)

#Print available tables
print connection.tables()
#Creating an instance
table = connection.table('mytablefrompythoncode')

Step 3 : hbase thrift start -threadpool (in Hortonworks Linux Machine)

Step 4 : To invoke this .py code from Linux, just copy the .py script to Linux local using WinScp or any tool and in command line type:
cd /Path_where_code_is
python xxxxxxxx.py

Thursday, 20 August 2015

Connecting Hortonworks Hive and Tableau

Connecton between Tableau and Hive



Download This for your platform :

Hive ODBC Driver for HDP 2.2 (v1.4.14)
The Hortonworks Hive ODBC Driver allows you to connect popular Business    Intelligence (BI) tools to query, analyze and visualize data stored within the Hortonworks    Data Platform.


Use this for settings:






username : hive
password : xxxxxxxxxx

You can find the above info from : your_ambari_server.cloudapp.net:8080


Friday, 7 August 2015

HDFS Commands


hadoop fs -ls          

hadoop fs -ls /

hadoop fs -mkdir test     (Make directory)

hadoop fs -mkdir -p test/test2/test3    (Create directories at one go )

hadoop fs -rm -R test/test2        (Remove)

cd /root/devph/labs/Lab2.1/
tail data.txt

hadoop fs -put data.txt test/

hadoop fs -cp test/data.txt test/test1/data2.txt

hadoop fs -rm test/test1/data2.txt

hadoop fs -cat test/data.txt

hadoop fs -tail test/data.txt

hadoop fs -get test/data.txt /tmp/

hadoop fs -put /root/devph/labs/demos/small_blocks.txt test/

hadoop fs -getmerge test /tmp/merged.txt        (Merge all files under this dir and put them to local)

hadoop fs -D dfs.blocksize=1048576 -put data.txt data.txt   (Fix block size and dump)

hdfs fsck /user/root/data.txt    (File system check-- No of blocks check)

Wednesday, 5 August 2015

Pseudo Node Setup

Create VM in Microsoft Azure.
Open ports:










sudo vi /etc/sysctl.conf
      fs.file-max = 65536

sudo vi /etc/security/limits.conf
                *          soft     nproc          65535
*          hard     nproc          65535
*          soft     nofile         65535
*          hard     nofile         65535

sudo vi /etc/security/limits.d/90-nproc.conf
*          soft     nproc          65535
*          hard     nproc          65535
*          soft     nofile         65535
*          hard     nofile         65535

sudo reboot

ssh-keygen

sudo chmod 700 ~/.ssh

chmod 600 ~/.ssh/authorized_keys

sudo yum install ntp

sudo chkconfig ntpd on

sudo hostname xxxxxxxxxxxx.cloudapp.net

hostname -f

sudo vi /etc/hosts
     100.73.40.57 xxxxxxxxxxxx.cloudapp.net

sudo vi /etc/sysconfig/network
     NETWORKING_IPV6=yes

sudo chkconfig iptables off

sudo /etc/init.d/iptables stop

setenforce 0

sudo vi /etc/selinux/config
     SELINUX=disabled

sudo vi /etc/profile

wget http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.0.1/ambari.repo

sudo cp ambari.repo /etc/yum.repos.d/

sudo yum install ambari-server

sudo ambari-server setup

sudo ambari-server start

sudo ambari-server status

sudo chkconfig iptables off

sudo /etc/init.d/iptables stop

Access to Ambari web: http://ambari.server.host:8080/
Follow the wizards to create your cluster.
They will ask for the list of nodes that you want to setup, use their FQDN to enter.

Create keys for your Linux Machine

Step 1: Install a Linux machine.
Step 2: Type openssl   --> if this is not working then in command line, enter -   
            sudo yum install openssl
Step 3: Next we will generate myCert.pem and myPrivateKey.key : 
            openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout myPrivateKey.key -out myCert.pem
Step 4: keep hitting enter, as it prompts for question.
Step 5: Now, type pwd , in command line and check the path. 
Step 6: Go to that path and you should have myCert.pem and myPrivateKey.key

Step 7: Now change the permissions of myPrivateKey.key(important)
            chmod 600 myPrivateKey.key
Step 8: Now, let us create the corresponding myCert.cer
            openssl  x509 -outform der -in /home/brillio/myCert.pem -out /home/brillio/myCert.cer
Step 9: Now we will create a ppk for putty. For this we first need to create RSA Private Key that PuttyGen(tool to create ppk) can understand.
            openssl rsa -in /home/brillio/myPrivateKey.key -out myPrivateKey_rsa
            chmod 600 /home/brillio/myPrivateKey_rsa
Step10: Now, download PuttyGen (This will generate ppk files for us to connect using putty)
Step 11: Now load the myPrivateKey_rsa and save the corresponding private and public keys.

Step 12: Now you should have the following keys.
Step 13: Next step is to download putty.exe.
Step 14: Open putty and provide hostname or IP Address as shown
Step 15: Click on Auth as shown, and provide the .ppk file just created. Click on open and you should be able to login to Linux machine using Putty.

Tuesday, 4 August 2015

Java Code to Kill Process in Linux (VM in Azure cloud)

Step 1 : It uses JSch jar file. Make sure you have downloaded it.
http://www.java2s.com/Code/JarDownload/jsch/jsch-0.1.42.jar.zip
Step 2 : Generate Private key for your linux machine in case you don't have. You can follow this link, for the same .
http://mydailyfindingsit.blogspot.in/2015/08/create-keys-for-your-linux-machine.html
Step 3 : Copy the code below, modify the lines highlighted in yellow.
Step 4 : Create the main() method to call the below function.


public boolean killPythonProcess(){

        int count = 5;
     
        JSch jsch = new JSch();
        String prvkey = "E:\\myPrivateKey_rsa";
        String host = "bdp-hdp.cloudapp.net";
        String user = "lokesh";
        String command = "kill -9 `ps -ef | grep python | grep root | awk '{print $2}'`";
        try{
        jsch.addIdentity(prvkey);
        Session session = jsch.getSession(user, host, 22);
        session.setConfig("StrictHostKeyChecking", "no");
        session.connect();
        com.jcraft.jsch.Channel channel = session.openChannel("exec");
        ((ChannelExec) channel).setCommand(command);
        channel.setInputStream(null);
        channel.setOutputStream(System.out);
        ((ChannelExec) channel).setErrStream(System.err);
        InputStream in = channel.getInputStream();
        InputStream error = channel.getExtInputStream();
        channel.connect();
        byte[] tmp = new byte[1024];
        while (count >= 0) {
        while(error.available() > 0){
                     int i = error.read(tmp, 0, 1024);
                     if (i < 0)
                      break;
                     System.out.print(new String(tmp, 0, i));
               }
               while (in.available() > 0) {
                     int i = in.read(tmp, 0, 1024);
                     if (i < 0)
                            break;
                     System.out.print(new String(tmp, 0, i));
                }
               if (channel.isClosed()) {
                    System.out.println("exit-status: " + channel.getExitStatus());
                     break;
               }
               Thread.sleep(1000);
               count--;
        }
         channel.disconnect();
         session.disconnect();

        }
         catch(JSchException | InterruptedException |IOException j){
        return false;
           }

return true;
}


Step 5 : Bingo ! Done.

HP-R205TU External Harddisk Not recognized

Well if the external hard-disk is working fine in other laptops and "red" light blinks when you insert your USB, then might you want to give this a try..

Step 1: Go to Device Manager and Click on USB Root Hub ( Highlighted)
Step 2: Right click and go to properties.
Step 3: Click on the Power Management tab , make sure the first option is unchecked.
Step 4 : Repeat step 2 and 3 for all USB related Hub in Universal Serial Bus controller (Device Manager)
Step 5 : Insert your External hard-disk, it should read now !

Oracle Virtual Machine Does Not Start on Windows 10

The workaround which i found is :

Uninstall the previous virtual box and download the version 4.3.12


http://download.virtualbox.org/virtualbox/4.3.12/VirtualBox-4.3.12-93733-Win.exe

Install it, and it should work fine. 

Make sure you have saved all your previous work !
Cheers!