Monday 20 July 2020

Terraform Update in Mac

Updating Terraform Version:

Download the latest terraform version:
https://releases.hashicorp.com/terraform/0.12.28/terraform_0.12.28_darwin_amd64.zip

Move it to path: /usr/local/bin

sudo mv Desktop/terraform /usr/local/bin

Tuesday 10 December 2019

Setup Java, Scala, Spark & Intellij in Mac

------------------
To Install JAVA:
------------------
    https://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5aa5bca3f542482c60/jdk-8u201-macosx-x64.dmg
    Extract
    Create a bash profile in user directory
    Open terminal -> $ vim .bash_profile
    export JAVA_HOME=$(/usr/libexec/java_home)

    Open new terminal/ source .bash_profile and enter -> echo $JAVA_HOME
    Type java -version

------------------
To Install Scala:
------------------
    https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
    Extract
    Open terminal and enter below commands:
    cd Downloads/

    sudo cp -R scala-2.11.12 /usr/local/scala
    cd
    vi .bash_profile
            export PATH=/usr/local/scala/bin:$PATH
    source .bash_profile
    Type scala

------------------
To Install Spark:
------------------
    https://www.apache.org/dyn/closer.lua/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
    Extract
    Copy the extracted spark folder to:  xxxxxxx/dev/apache-spark/      (You can choose any path)
    Open terminal
    vi .bash_profile
      export SPARK_HOME=/Users/lokeshnanda/xxxxxxx/dev/apache-spark/spark-2.4.4-bin-hadoop2.7
      export PATH=$PATH:$SPARK_HOME/bin
    source .bash_profile
    Type spark-shell and it should open spark

Now install Intellij and add scala plugin. Add sbt dependencies for Spark core(It will take 10-15mins).


Enter below in build.sbt:

name := "TestSpark"
version := "0.1"
scalaVersion := "2.11.12"
// https://mvnrepository.com/artifact/org.apache.spark/spark-corelibraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sqllibraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"

// https://mvnrepository.com/artifact/org.apache.spark/spark-mlliblibraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.4.4" % "runtime"

// https://mvnrepository.com/artifact/org.apache.spark/spark-streaminglibraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"

Monday 6 August 2018

Create a Time-Series data from CSV

If your data contains Date column and you want to convert the date column as index and in datetime format, use the below codes:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
from IPython.display import display
rcParams['figure.figsize'] = 15, 6

dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m-%d')

data = pd.read_csv('I1974A_LCT_DAL_WTR_HST_1792_2016_2018_modified.txt', sep = '|', names = ['CRN_YR_CMA_LCT_CD','LCT_NBR','CAL_DT','MAX_TPU_NBR','NRM_MAX_TPU_NBR','MIN_TPU_NBR','NRM_MIN_TPU_NBR','PIT_QTY','NRM_PIT_QTY','SNO_QTY','NRM_SNO_QTY','WTR_DES_TXT'], parse_dates=['CAL_DT'], index_col='CAL_DT',date_parser=dateparse)
display(data.head())
display(data.dtypes)

selecteddf = data['MAX_TPU_NBR']
display(selecteddf.head())

plt.plot(selecteddf)

Friday 3 August 2018

Basic Data Cleaning Techniques in Python using DataFrames

Get Number of NULLS in a DataFrame

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline

train_data = pd.read_csv('./data/train.csv')
null_in_train_csv = train_data.isnull().sum()
null_in_train_csv = train_data.isnull().sum()
null_in_train_csv = null_in_train_csv[null_in_train_csv > 0]
null_in_train_csv.sort_values(inplace=True)
null_in_train_csv.plot.bar()
null_in_train_csv

Filter Columns based on Correlation:

sns.heatmap(train_csv.corr(), vmax=.8, square=True);
arr_train_cor = train_csv.corr()['SalePrice']
idx_train_cor_gt0 = arr_train_cor[arr_train_cor > 0].sort_values(ascending=False).index.tolist()
arr_train_cor[idx_train_cor_gt0]

Find count of zeroes in a Column:

zero_in_masvnrarea = train_meta['MasVnrArea'][train_meta['MasVnrArea'] == 0].index.tolist()

Replace all NULL with 0 if %of 0 in a column is more than 50%

null_in_masvnrarea = train_meta[train_meta['MasVnrArea'].isnull()].index.tolist()
zero_in_masvnrarea = train_meta['MasVnrArea'][train_meta['MasVnrArea'] == 0].index.tolist()
print("How many null value in MasVnrArea? %d / 1460" % len(null_in_masvnrarea))
print("How many zero value in MasVnrArea? %d / 1460" % len(zero_in_masvnrarea))

train_meta['MasVnrArea'][null_in_masvnrarea] = 0

Create a new column with value as 0 if 0 else for non zero value 1:

nonzero_in_masvnrarea = train_clean['MasVnrArea'][train_clean['MasVnrArea'] != 0].index.tolist()
train_clean['has_MasVnrArea'] = 0
train_clean['has_MasVnrArea'][nonzero_in_masvnrarea] = 1

Create a new binned column:

digitize: This is a function. First define the bins, i.e value for a range. 
In the below example -1,1 -> 1, 1,1004 -> 2, 1004,4000 -> 3

When you run this output column will have 1/2/3 based on the value lies in which range.

bins_totalbsmtsf = [-1, 1, 1004, 4000]
train_clean['binned_TotalBsmtSF'] = np.digitize(train_clean['TotalBsmtSF'], bins_totalbsmtsf)

Display in Ipython:

from IPython.display import display

-----------------------------------------------------------------------------

Merge Test and Train Data by removing to be predicted column:

concat_set = pd.concat((train_data, pd.read_csv('test.csv'))).reset_index()

Create a random Age list based on mean and std and the data for missing column records:

age_avg = concat_set['Age'].mean()
age_std = concat_set['Age'].std()

age_null_count = concat_set['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count)
concat_set['Age'][np.isnan(concat_set['Age'])] = age_null_random_list

concat_set['Age'] = concat_set['Age'].astype(int)

Create a Categorical Column:

concat_set['CategoricalAge'] = pd.cut(concat_set['Age'], 5)

Create a new column by Splitting a column and extract values from it:

concat_set['Title'] = concat_set['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())

concat_set['Title'].value_counts()


Fillna example:

concat_set['Fare'].fillna((concat_set['Fare'].median()), inplace=True)

If else one liner:

concat_set['IsAlone'] = 0

concat_set.loc[concat_set['FamilySize'] == 1, 'IsAlone'] = 1

Use of categorical function in pandas to create numericals:

for feature in concat_set.keys():

    concat_set[feature] = pd.Categorical(concat_set[feature]).codes

Display all the columns in a df as output:

pd.set_option('display.max_columns', None)



Tuesday 15 May 2018

Simple Threading execution to download json data

from datetime import datetime
from multiprocessing.pool import ThreadPool
requests.packages.urllib3.disable_warnings() 
#comment_list_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentlist'
comment_list_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentlist_nov2017'
commentidsdeltapath = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentdeltalist_nov2017' 
#commentidsdeltapath = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentdeltalist' 
#raw_comment_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\inmoment_weekly_data\\comment_2017_2018'
raw_comment_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\inmoment_weekly_data\\comment_nov2017'
access_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

def read_comment_ids(comment_list_path):
    comment_ids = open(comment_list_path).read().splitlines()
    return comment_ids
    
def get_updated_comment_list(comment_ids_list,comment_ids_delta_list):
    diff_list = list(set(comment_ids_list)-set(comment_ids_delta_list))
    return diff_list

def update_comment_delta(commentidsdeltapath, commentId):
    with open(commentidsdeltapath, "a") as file:
        file.write(str(commentId))
        file.write('\n')

def getDateTime():
    return str(datetime.now()).replace(' ','')

def writeComment(text, commentId):
    with open(raw_comment_path, "a") as file:
        json.dump(text, file)
        file.write('\n')
    update_comment_delta(commentidsdeltapath, commentId)
            
def getComment(commentId):
    headers = {'content-type': 'application/json'}
    headers['Authorization'] = 'OAuth '+access_token
    get_comment = "https://www.xxxxxxxxxxxxt.com/api/rest/1.0/comment/"+str(commentId)
    try:
        r = requests.get(get_comment, headers=headers, verify=False)
    except Exception as err:
        print(str(err))
        time.sleep(30)
        print('Sleeping for 30 sec..........')
        r = requests.get(get_comment, headers=headers, verify=False)
    if(r.status_code == 200):
        val = json.loads(r.text)
        '''
        with open(raw_comment_path, "a") as file:
            json.dump(val, file)
            file.write('\n')
        update_comment_delta(commentidsdeltapath, commentId)
        '''
        return val, commentId 
            
try:
    comment_ids_list = read_comment_ids(comment_list_path)
except Exception as e:
    print('There is an error reading from ' + str(comment_list_path))
    
try:
    comment_ids_delta_list = read_comment_ids(commentidsdeltapath)
except Exception as e:
    print('There is an error reading from ' + str(commentidsdeltapath)+ ' This is first load')
    comment_ids_delta_list = []

diff_commentid_list_raw = get_updated_comment_list(comment_ids_list,comment_ids_delta_list)
    
count_diff_id = len(diff_commentid_list_raw)
if count_diff_id == 0:
    print("All comment id's done")
else:
    #for commentId in diff_commentid_list_raw:
        #getComment(access_token,commentId)
    results = ThreadPool(20).imap_unordered(getComment, diff_commentid_list_raw)
    for text, commentId in results:
        writeComment(text, commentId)
        
try:
    comment_ids_delta_list = read_comment_ids(commentidsdeltapath)
    diff_commentid_list_raw = get_updated_comment_list(comment_ids_list,comment_ids_delta_list)
    count_diff_id = len(diff_commentid_list_raw)
    if count_diff_id == 0:
        print('DOne !!!!!!')
    else:
        print('[main] - Count of diff id is not equal to 0, NEED TO RERUN')
except Exception as e:
    print(str(e))

Monday 19 March 2018

Basic Encryption and Decryption in Python

pip install pycrypto


# -*- coding: utf-8 -*-
'''
Created on Mon Mar 12 16:33:10 2018

@author: lnanda

Script to generate a random text which can be passed to properties file as a basic security mech
NS02HJxCa2rH5sbJEVw7UVZcOIv89eHFM7hIFgYDTD8=

NOTE: Key must be 16 digits long

Sample call:
C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\inmoment_codebase>python encrypt_me.py 1102 8050614890805061

'''

from Crypto.Cipher import AES
import base64
import sys

def encrypt_me(secret_msg, key):
    msg_text = secret_msg.rjust(48)
    secret_key = key
    cipher = AES.new(secret_key,AES.MODE_ECB)
    encoded = base64.b64encode(cipher.encrypt(msg_text))
    return encoded.strip().decode("utf-8")

if __name__ == '__main__':
    secret_msg = sys.argv[1]
    key = sys.argv[2]
    encoded_value = encrypt_me(secret_msg, key)
    print('Please find the encoded value below:')
    print(encoded_value)


-------------------------------------------------------------------

# -*- coding: utf-8 -*-
"""
Created on Mon Mar 12 16:33:10 2018

@author: lnanda

Script to generate a random text which can be passed to properties file as a basic security mech

NOTE: Key must be 16 digits long
"""

from Crypto.Cipher import AES
import base64
import sys

def decrypt_me(encoded_msg, key):
    secret_key = key
    cipher = AES.new(secret_key,AES.MODE_ECB)
    decoded = cipher.decrypt(base64.b64decode(encoded_msg))
    return decoded.strip().decode("utf-8")

if __name__ == '__main__':
    encoded_msg = sys.argv[1]
    key = sys.argv[2]
    decoded_value = decrypt_me(encoded_msg, key)
    print(decoded_value)

Wednesday 20 September 2017

Work with Spark in Windows

Quick easy guide to setup spark in Windows

Windows x64197.78 MB  jdk-8u144-windows-x64.exe
  • Open command prompt and type java -version    
  • The above should give you a response. Java installation is done now.


  • Now, create the below folder structure in C:\
C:\Hadoop\bin
Copy the downloaded winutils.exe, in the above path.
  • Create a new system env variable - "HADOOP_HOME" and its value "C:\Hadoop" 
  • Now, open command line terminal as administrator, and enter the below commands:
C:\WINDOWS\system32>cd \
C:\>mkdir tmp
C:\>cd tmp
C:\tmp>mkdir hive
C:\tmp>c:\hadoop\bin\winutils chmod 777 \tmp\hive
C:\tmp>

  • Now we need to download spark, https://spark.apache.org/downloads.html
  • Once this is done, extract the downloaded file using 7zip. (You need to extract twice, as it is tar.gz)
  • Now create a folder "spark" in C:\\
  • Copy the contents of "C:\Users\Lokesh\Downloads\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\" to C:\spark\
  • Now go to system environment variables, and create a variable "SPARK_HOME" and its value as "C:\spark"
  • Now, it is time to test spark, open cmd as administrator and type
    • C:\spark\bin\spark-shell