Simple and effective Workarounds !

Monday, 20 July 2020

Terraform Update in Mac

Updating Terraform Version:

Download the latest terraform version:
https://releases.hashicorp.com/terraform/0.12.28/terraform_0.12.28_darwin_amd64.zip

Move it to path: /usr/local/bin

sudo mv Desktop/terraform /usr/local/bin

Tuesday, 10 December 2019

Setup Java, Scala, Spark & Intellij in Mac

------------------
To Install JAVA:
------------------
https://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5aa5bca3f542482c60/jdk-8u201-macosx-x64.dmg
Extract
Create a bash profile in user directory
Open terminal -> $ vim .bash_profile
export JAVA_HOME=$(/usr/libexec/java_home)

Open new terminal/ source .bash_profile and enter -> echo $JAVA_HOME
Type java -version

------------------
To Install Scala:
------------------
https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
Extract
Open terminal and enter below commands:
cd Downloads/

sudo cp -R scala-2.11.12 /usr/local/scala
cd
vi .bash_profile
export PATH=/usr/local/scala/bin:$PATH
source .bash_profile
Type scala

------------------
To Install Spark:
------------------
https://www.apache.org/dyn/closer.lua/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
Extract
Copy the extracted spark folder to: xxxxxxx/dev/apache-spark/ (You can choose any path)
Open terminal
vi .bash_profile
export SPARK_HOME=/Users/lokeshnanda/xxxxxxx/dev/apache-spark/spark-2.4.4-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
source .bash_profile
Type spark-shell and it should open spark

Now install Intellij and add scala plugin. Add sbt dependencies for Spark core(It will take 10-15mins).

Enter below in build.sbt:

name := "TestSpark"
version := "0.1"
scalaVersion := "2.11.12"
// https://mvnrepository.com/artifact/org.apache.spark/spark-corelibraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.4"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sqllibraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"


// https://mvnrepository.com/artifact/org.apache.spark/spark-mlliblibraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.4.4" % "runtime"


// https://mvnrepository.com/artifact/org.apache.spark/spark-streaminglibraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"

Monday, 6 August 2018

Create a Time-Series data from CSV

If your data contains Date column and you want to convert the date column as index and in datetime format, use the below codes:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
from IPython.display import display
rcParams['figure.figsize'] = 15, 6

dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m-%d')

data = pd.read_csv('I1974A_LCT_DAL_WTR_HST_1792_2016_2018_modified.txt', sep = '|', names = ['CRN_YR_CMA_LCT_CD','LCT_NBR','CAL_DT','MAX_TPU_NBR','NRM_MAX_TPU_NBR','MIN_TPU_NBR','NRM_MIN_TPU_NBR','PIT_QTY','NRM_PIT_QTY','SNO_QTY','NRM_SNO_QTY','WTR_DES_TXT'], parse_dates=['CAL_DT'], index_col='CAL_DT',date_parser=dateparse)
display(data.head())
display(data.dtypes)

selecteddf = data['MAX_TPU_NBR']
display(selecteddf.head())

plt.plot(selecteddf)

Friday, 3 August 2018

Basic Data Cleaning Techniques in Python using DataFrames

Get Number of NULLS in a DataFrame

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline

train_data = pd.read_csv('./data/train.csv')
null_in_train_csv = train_data.isnull().sum()
null_in_train_csv = train_data.isnull().sum()
null_in_train_csv = null_in_train_csv[null_in_train_csv > 0]
null_in_train_csv.sort_values(inplace=True)
null_in_train_csv.plot.bar()
null_in_train_csv

Filter Columns based on Correlation:

sns.heatmap(train_csv.corr(), vmax=.8, square=True);
arr_train_cor = train_csv.corr()['SalePrice']
idx_train_cor_gt0 = arr_train_cor[arr_train_cor > 0].sort_values(ascending=False).index.tolist()
arr_train_cor[idx_train_cor_gt0]

Find count of zeroes in a Column:

zero_in_masvnrarea = train_meta['MasVnrArea'][train_meta['MasVnrArea'] == 0].index.tolist()

Replace all NULL with 0 if %of 0 in a column is more than 50%

null_in_masvnrarea = train_meta[train_meta['MasVnrArea'].isnull()].index.tolist()
zero_in_masvnrarea = train_meta['MasVnrArea'][train_meta['MasVnrArea'] == 0].index.tolist()
print("How many null value in MasVnrArea? %d / 1460" % len(null_in_masvnrarea))
print("How many zero value in MasVnrArea? %d / 1460" % len(zero_in_masvnrarea))

train_meta['MasVnrArea'][null_in_masvnrarea] = 0

Create a new column with value as 0 if 0 else for non zero value 1:

nonzero_in_masvnrarea = train_clean['MasVnrArea'][train_clean['MasVnrArea'] != 0].index.tolist()
train_clean['has_MasVnrArea'] = 0
train_clean['has_MasVnrArea'][nonzero_in_masvnrarea] = 1

Create a new binned column:

digitize: This is a function. First define the bins, i.e value for a range.
In the below example -1,1 -> 1, 1,1004 -> 2, 1004,4000 -> 3

When you run this output column will have 1/2/3 based on the value lies in which range.

bins_totalbsmtsf = [-1, 1, 1004, 4000]

train_clean['binned_TotalBsmtSF'] = np.digitize(train_clean['TotalBsmtSF'], bins_totalbsmtsf)

Display in Ipython:

from IPython.display import display

-----------------------------------------------------------------------------

Merge Test and Train Data by removing to be predicted column:

concat_set = pd.concat((train_data, pd.read_csv('test.csv'))).reset_index()

Create a random Age list based on mean and std and the data for missing column records:

age_avg = concat_set['Age'].mean()
age_std = concat_set['Age'].std()

age_null_count = concat_set['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count)
concat_set['Age'][np.isnan(concat_set['Age'])] = age_null_random_list

concat_set['Age'] = concat_set['Age'].astype(int)

Create a Categorical Column:

concat_set['CategoricalAge'] = pd.cut(concat_set['Age'], 5)

Create a new column by Splitting a column and extract values from it:

concat_set['Title'] = concat_set['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())

concat_set['Title'].value_counts()

Fillna example:

concat_set['Fare'].fillna((concat_set['Fare'].median()), inplace=True)

If else one liner:

concat_set['IsAlone'] = 0

concat_set.loc[concat_set['FamilySize'] == 1, 'IsAlone'] = 1

Use of categorical function in pandas to create numericals:

for feature in concat_set.keys():

concat_set[feature] = pd.Categorical(concat_set[feature]).codes

Display all the columns in a df as output:

pd.set_option('display.max_columns', None)

Tuesday, 15 May 2018

Simple Threading execution to download json data

from datetime import datetime
from multiprocessing.pool import ThreadPool
requests.packages.urllib3.disable_warnings()
#comment_list_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentlist'
comment_list_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentlist_nov2017'
commentidsdeltapath = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentdeltalist_nov2017'
#commentidsdeltapath = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\\inmoment_weekly_surveyids\\commentdeltalist'
#raw_comment_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\inmoment_weekly_data\\comment_2017_2018'
raw_comment_path = 'C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\inmoment_weekly_data\\comment_nov2017'
access_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

def read_comment_ids(comment_list_path):
comment_ids = open(comment_list_path).read().splitlines()
return comment_ids

def get_updated_comment_list(comment_ids_list,comment_ids_delta_list):
diff_list = list(set(comment_ids_list)-set(comment_ids_delta_list))
return diff_list

def update_comment_delta(commentidsdeltapath, commentId):
with open(commentidsdeltapath, "a") as file:
file.write(str(commentId))
file.write('\n')

def getDateTime():
return str(datetime.now()).replace(' ','')

def writeComment(text, commentId):
with open(raw_comment_path, "a") as file:
json.dump(text, file)
file.write('\n')
update_comment_delta(commentidsdeltapath, commentId)

def getComment(commentId):
headers = {'content-type': 'application/json'}
headers['Authorization'] = 'OAuth '+access_token
get_comment = "https://www.xxxxxxxxxxxxt.com/api/rest/1.0/comment/"+str(commentId)
try:
r = requests.get(get_comment, headers=headers, verify=False)
except Exception as err:
print(str(err))
time.sleep(30)
print('Sleeping for 30 sec..........')
r = requests.get(get_comment, headers=headers, verify=False)
if(r.status_code == 200):
val = json.loads(r.text)
'''
with open(raw_comment_path, "a") as file:
json.dump(val, file)
file.write('\n')
update_comment_delta(commentidsdeltapath, commentId)
'''
return val, commentId

try:
comment_ids_list = read_comment_ids(comment_list_path)
except Exception as e:
print('There is an error reading from ' + str(comment_list_path))

try:
comment_ids_delta_list = read_comment_ids(commentidsdeltapath)
except Exception as e:
print('There is an error reading from ' + str(commentidsdeltapath)+ ' This is first load')
comment_ids_delta_list = []

diff_commentid_list_raw = get_updated_comment_list(comment_ids_list,comment_ids_delta_list)

count_diff_id = len(diff_commentid_list_raw)
if count_diff_id == 0:
print("All comment id's done")
else:
#for commentId in diff_commentid_list_raw:
#getComment(access_token,commentId)
results = ThreadPool(20).imap_unordered(getComment, diff_commentid_list_raw)
for text, commentId in results:
writeComment(text, commentId)

try:
comment_ids_delta_list = read_comment_ids(commentidsdeltapath)
diff_commentid_list_raw = get_updated_comment_list(comment_ids_list,comment_ids_delta_list)
count_diff_id = len(diff_commentid_list_raw)
if count_diff_id == 0:
print('DOne !!!!!!')
else:
print('[main] - Count of diff id is not equal to 0, NEED TO RERUN')
except Exception as e:
print(str(e))

Monday, 19 March 2018

Basic Encryption and Decryption in Python

pip install pycrypto

# -*- coding: utf-8 -*-
'''
Created on Mon Mar 12 16:33:10 2018

@author: lnanda

Script to generate a random text which can be passed to properties file as a basic security mech
NS02HJxCa2rH5sbJEVw7UVZcOIv89eHFM7hIFgYDTD8=

NOTE: Key must be 16 digits long

Sample call:
C:\\Users\\lnanda\\Desktop\\Lokesh\\inmoment\\inmoment_main\\inmoment_codebase>python encrypt_me.py 1102 8050614890805061

'''

from Crypto.Cipher import AES
import base64
import sys

def encrypt_me(secret_msg, key):
msg_text = secret_msg.rjust(48)
secret_key = key
cipher = AES.new(secret_key,AES.MODE_ECB)
encoded = base64.b64encode(cipher.encrypt(msg_text))
return encoded.strip().decode("utf-8")

if __name__ == '__main__':
secret_msg = sys.argv[1]
key = sys.argv[2]
encoded_value = encrypt_me(secret_msg, key)
print('Please find the encoded value below:')
print(encoded_value)

-------------------------------------------------------------------

# -*- coding: utf-8 -*-
"""
Created on Mon Mar 12 16:33:10 2018

@author: lnanda

Script to generate a random text which can be passed to properties file as a basic security mech

NOTE: Key must be 16 digits long
"""

from Crypto.Cipher import AES
import base64
import sys

def decrypt_me(encoded_msg, key):
secret_key = key
cipher = AES.new(secret_key,AES.MODE_ECB)
decoded = cipher.decrypt(base64.b64decode(encoded_msg))
return decoded.strip().decode("utf-8")

if __name__ == '__main__':
encoded_msg = sys.argv[1]
key = sys.argv[2]
decoded_value = decrypt_me(encoded_msg, key)
print(decoded_value)

Wednesday, 20 September 2017

Work with Spark in Windows

Quick easy guide to setup spark in Windows

Go to http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html and download jdk. Below is a sample link for 64bit.

Windows x64

197.78 MB

jdk-8u144-windows-x64.exe

Open command prompt and type java -version
The above should give you a response. Java installation is done now.

Now open http://www.7-zip.org/ download and install 7zip.

Next, open https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe ,click on the download button.

Now, create the below folder structure in C:\

C:\Hadoop\bin
Copy the downloaded winutils.exe, in the above path.

Create a new system env variable - "HADOOP_HOME" and its value "C:\Hadoop"
Now, open command line terminal as administrator, and enter the below commands:

C:\WINDOWS\system32>cd \
C:\>mkdir tmp
C:\>cd tmp
C:\tmp>mkdir hive
C:\tmp>c:\hadoop\bin\winutils chmod 777 \tmp\hive
C:\tmp>

Now we need to download spark, https://spark.apache.org/downloads.html

https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz

Once this is done, extract the downloaded file using 7zip. (You need to extract twice, as it is tar.gz)
Now create a folder "spark" in C:\\
Copy the contents of "C:\Users\Lokesh\Downloads\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\" to C:\spark\
Now go to system environment variables, and create a variable "SPARK_HOME" and its value as "C:\spark"
Now, it is time to test spark, open cmd as administrator and type

C:\spark\bin\spark-shell