I am an experienced working professional with 5 years of overall experience in various domains like Ecommerce, Non Banking Institutions.
Business Intelligence Engineer
Wipro (Charles Schwab)Data Analyst
EmbibeSales Data Analyst
Skill LyncRisk Analyst
AmazonBusiness Analyst
Shriram Transport Finance Co LtdMySQL
Python
Tableau
Power BI
Advance Excel
Microsoft Azure SQL Database
Azure Data Lake Storage Gen2 (ADLS)
Azure Synapse
So I'm certified in professional data analytics with more than 6 years of experience in its domains of data science and analytics. I have worked with multiple domains. I started my career with ecommerce segment, and it was Amazon I started as a business analyst. My busy job role was to determine the fraud transactions and save the company's asset from large amount of loss. So I use you usually, you should write a SQL script to get to fetch the transactional data, manually check the high value order spectrums, and take actions based upon that. Next upon, I went into Shriram transferred finance company as a risk analytics again, where my job role was to determine the fraud patterns in the room. If, uh, optimize the portfolio and, uh, generate insights from the portfolio regarding acquisitions of a customer through referrals, working on the b to b team and a GTM team to get free weekly reports, monthly reports. Next up to that, I went into Skill Link as a data analyst where my technical role came into picture. So I usually work with large scale data in in Skill Link. I used to use AWS as a database. I mean, time zone history as a database and AWS Lambda for writing script. In the organization, we help the management to generate more revenue through user analyzing the user engagement activities in the leads. Uh, I built the weekly, monthly, and quarterly reports on the on the, uh, revenues again and people's performance. Uh, I also tracked the, uh, behavior of the user in terms of what courses they are searching and helping on, how long they are spending their time in the website? Where they are spending their time? If they're dropping out, what is the pain point of the dropout? So I generally built a dropout recommendation. And then continue to grow. I went into mBIE as a again, as a data analyst, uh, data analyst too. And I was working closely with the data engineering team. So here, I help the engineering team to help with the bike building maintenance. The cloud services which I used is Azure, uh, uh, over here in. I had, uh, the organization to build many dashboard. The main dashboard is a business optimization dashboard with more than 50 KPIs indicating the trends on the from the past data and helping, uh, the stakeholders to generate an insight on the forecasting side. So for example, if we generate uncertain pattern, how are the pattern of the users coming on the platform and how long they can come come. We have also implemented AB testing. For example, if we pre certain tags, uh, on on a website, so how they are performing, how the different events are performing, which are the panels. We we I created a a sign up channel analysis dashboard, which help the organization to get the data on the real time, which are my highest performance banner. And what are the parameters for those highest performance banner so that I I can optimize our courses according to those click rates and also, uh, used, drop off recommendation hotel. And apart from them, I I used to do the weekly, monthly, and daily reports for the GTM team also, helping to, uh, the pain points of the customer.
So bloodstream model are one of the most important, uh, it's I have used clustering's technique in, um, in optimizing the marketing campaigns, basically. So the main important, uh, you know, you know, the how we build is in in clustering is a key in and a knife base. It's amazing. The key in and what it does, we generally use scikit learn package to import the k n, um, machine learning algorithms. From scikit learn, we get kn, and, uh, we can what kn does? Kn is naive base is actually what they are doing. They are collecting the points based on the customer usage. For example, a customer is from a certain location. They are using a certain amount of data. They are visiting on a certain period of time. They're visiting a certain, uh, let's see, certain channels. They are visiting our website and they are creating basically a pattern. Let's say some visit on the morning, some visit on the evening, some uh, visit the content a, some visit content b. So basically and also, uh, aggregate in different details such as the personal personal fields such as the gender, usage duration, uh, take rates, how how often, how how the duration of how long we are are the cons are the consistently being from scroll I mean, are they consistently scrolling the page? So accumulating all the data points, we basically create group of small segments of those categories of the customer. So, basically, we use scikit learn package from, uh, Python to, uh, to get the clusters and using the naive weights or k n n theorem.
So to validate a data accuracy and consistent, uh, consistency in SQL, first, we will be looking into data ingestion. From where the data ingestion I mean, from where the data are ingested in from multiple sources. So they are basically how the extraction, transformation, and loading is done through Azure database. We are usually doing it with Azure data updates. So is there any inconsistency or is there any discrepancy with the data types or not? The first thing we will be doing to check the validated data accuracy. Next, to assume the consistency, uh, we will check for the null values. For example, if there there are multiple there are 50,000 columns. Right? There are 50,000 records in a in a table. And let's say it is a table of an employee table. So an employee table consists of a date, a joining date, employee ID, employee name. If the joining date is missing for, let's say, 100, um, 100 employees and the employer IDs are generally old. How we can recognize the old? It's I mean, it's a very old employee ID, like, 101, 102, and now the series is going on for, like, you know, the, uh, 500 or 6. And So first, we will we will be checking where do not where the null null null values are. Then we can replace with the, uh, the very the very minimum date of the entire table. So the very first thing which we will be doing is we will be checking into the ingestion model. If there is any discrepancy in the ingestion. Once the data fields are accurate, the transform the extraction transformations done accurately, then we will be altering or we will be updating the records accordingly. Sometimes there may be the front end may pass some different events, which are basically duplicates events. So we can actually, uh, reconcile those reduplicates even and and and collapse it into 1 with case statements and then actually update the entire table. So the first thing is the ETL. Second thing, wherever the null values, we should impute the null values accordingly. 3rd thing is instead of duplicate score we can collage into 1 for example there is a goal and g o a l goal and another is g is capital the first one is g small letter so this way we can use case statement and we can collage into 1. Similarly, there may be other fields which has duplicate value and will be individually individually we will be check checking each, I mean, distinct record of the, uh, categorical field, and we can update the database accordingly after Koolachi, the duplicate record. This is how we can do.
So when we generally analyze data from high velocity volume sources. So that means the the it has a vast number of records, millions and trillions of records. The first thing which we will be doing to get the data is optimizing the query. We cannot get the data all at once. So first, we will be looking at the top 100 records to check the flow of the table. Once you are understanding so in what happens in a live database, suppose we if you're working on the cloud services, and if I'm working on Azure Synapse Studio. So first, we will be checking into the first hundred records. Once you're getting a good picture of how the tables are, we can apply the the only required columns, which we are need for the business stakeholders, I mean, which we need to fetch, then we can we can accumulate and we can optimize it 20 accordingly. Suppose we need we need to get the data for 6 months, we will be applying the filters. Suppose we want to get the, uh, get the data for a particular field. Let's say we want source to be SEO, so we can just filter it out. So the very first thing is we cannot get the entire 50 lakh records in one go. So we have to optimize our query. The the first thing, very very first thing we need to do is optimizing the query and then running so that we got we can get what we require. Next thing, what we need to do is normalization. Because without normalization, if there is a multiple records, we cannot store the entire thing in a one table. We will be creating different IDs. Let's say, we have a user ID. Right? And we want to store how the user patterns are. Right? Let's say we want to store the timestamp. So if we do the timestamp accordingly, I mean, if you do the visa IDs and one user is visiting once in a day, so we can do date wise in one day, one user ID, visitor ID. In that visitor ID, visitor ID can be created for one day. In that visitor ID, we can store the timestamp. How many times in that particular day the user has visited the website. So we can generally reduce and we can make the reduce the dataset, and we can make data scalable and usable for the other teams. This is how we should think optimizing the query using normalization, connecting different tables with the IDs using inner joins. You'll see left join, whatever is required for the business purpose. And that is the only way to what we should do to handle and analyze data from high velocity and volume sources.
So, generally, in any BI tool, for example, Tableau or Power BI, you can create a dynamic and interactive dashboard anytime from different multiple sources. Now what we can do instead of importing the data, let's see I'm connecting my Tableau with my Azure SQL Server using Synapse Studio. Instead of importing my dataset, I can use detect query method. In that detect query, what will happen, it will auto wait. And if I set the if I said the refresh timing within, like, let's say, 5 minutes or every 10 minutes, first, we need to optimize the query, imply direct query method instead of importing the dataset at one go. So it will generally whenever we are using, I mean, it will directly query and get the data instantly. So whatever we are making from that data set, let's say whatever analysis we're making from that data set, let it be any analysis. It will be interactive if we apply different kind of filters, if we apply different kind of, you know, drill down steps, if we kind apply different kind of, uh, action bat buttons, if we apply different kind of filters, if you can apply direct, I mean, different kind of, uh, let's say graph using DAX queries. Right? All the things making it dynamic and interactive. And in a real time data monitoring, we can use direct query instead of importing the data. This is how we can we I mean, I usually create real time monitoring data.
So the data can be trans I mean, data is usually that is called a migration. We can usually migrate data from one platform to another platform. For that, we need to connect the Tableau with the Power BI source. So there will be multiple linkage of the channels. Power BI will be connected to the data warehousing system. Tableau will be getting connected to the Power BI. Now if we want to get all of the data imported to Tableau, we can do so. And if we want on a real time basis, we can just import whatever the same, uh, query syntax to the dash Tableau dashboard I mean, to the Tableau desktop server, imply those things, and get the transition. So generally, I use Power BI a lot. Tableau, I have used respectively to create Internet to dashboard, but transitions, we have done very less in my client organization. So this is I know how we do this process, but and this is how we need to explore and make it possible.
So the main problem over here is n clusters equal to auto. If you do not define the number of the clusters, so the key means algorithm or the key means package will generally will not be able to determine the number of any number of clusters, which they want to make any number of data points. Remember, we see earlier the multiple data points. Let's see if we want to take 5 data points, it will categorize the data in the same way. If you want to take 8 data points, it will categorize the data in the same way. So if we do not generally define the end clusters, so we the the code snippet will not be functional correctly. The main thing.
And how can I profile an insight into what should be correct? Okay. Module ID and risk for the risk. Is that all? So over here, you are using module ID to aggregate the data of a score, but you are grouping by student ID. This is the thing which is a error and which should be rectified because whenever we are categorizing a table, I mean, aggregating a table. If we're aggregating the table with the module ID, we should group by with the module ID to get accurate data. We can use VAST student ID. We can use that VAST student ID To filter, we can use student ID n and bracket number of student ID. But if you want to get module wise average scores, there are multiple modules in a particular student ID. So we should group by the entire dataset with the module ID.
So in cloud services, when we get the like, in cloud services, what happened? Data ads are generally captured from multiple data sources. We do web scrapping, and lot of a lot of datas are coming from different different sources. So generally, what happens in Azure since I am you used with the Azure, definitely giving example, we forward with Azure. In Azure Data Factory, what we do, we do the ETL process. Now using Azure Databricks. Now what happens if the data are multiple data bytes? It's a very large scale data. Right? The very first thing which we will be using, the very first thing within the transformation process to capture the correct data. Then we will aggregate the data and store it down. Let's say if you want to cap if you want to include each record, each record with a timestamp, I told you earlier, it will take a long I mean, it will take generally a huge 50 millions of dataset. I mean, 50,000,000 record of data in a particular dataset in a particular container. Now if you if you want to if you want, generally, you want to, uh, retrieve that data. So it will generally become very in scalable to retrieve data or to write this query script script and fetch the data. So what we can do instead, we can aggregate the data and then store it. Number 1 thing. Or the second thing what we can do is we can do normalization. Right? We can store in different IDs, and we can create a different different tables and interlinkage with the different tables. 3rd thing we what we can do is we actually can create other dictionaries and link interlinked with that. Let's say we have a path name for each of the records, so we have path. So what we can do, we can actually create an ID for a specific path. Let's say it's if you're we if a user is viewing in a website and are viewing a particular content, those that particular content will have ID. We can map the ID with the path name and then I can show it down. So it will reduce the the size of the dataset. Right? So the very first thing which we will be doing, it's we'll try to do the aggregation. If aggregation is not possible, if there are multiple categoricals variable, we'll try to normalize the dataset. And in this way, the data quality and efficiency is maintained in SQL.
2 ways we can do. Let's say I have already I have built I have imported the data. I have already made a dashboard. I have done the dashboard interactive dynamic. According to my stakeholders requirement, I have made all my requirements using action filters, actions filters, and whatever the graphs may be, and everything using the text query measures and all. Once I'm publishing it to the, uh, service table service, right, then we can create basically, we can use a, um, uh, schedule refresh time, which will be each minute. In that way, we can actually the dashboard will be refreshed each after 4 minutes or 5 minute time interval, let's say. This is the most efficient way of getting the dashboard refreshed. Next we can do is again live queries. I mean, instead of importing down the whole dataset, we can actually use a direct perimeter. But in that case, what happens basically in in a in a direct perimeter? Let's see if we there are 2 kinds of pool. 1 is dedicated pool and one is serverless pool. If I'm running the query on a serverless pool, Let's see the query. The the the dashboard refresh after each of the 5 minutes. So in your organization demand, there may be multiple uh, teams which will be working on SQL. If, uh, 50 to 20 a 100 people are actually working on a same serverless pool and getting down tables and getting writing the script on the server. Refreshing your dashboard can make a little long time for you and for the team also. Right? So instead, what we can do, we can use ADX as your data explorer with kQL, Kusto query language. That is how we can write. I mean, link the data with the live dashboards. We can. The dashboard can be refreshed without the interruption because ADX is designed in such a way to handle large scale dataset. And without importing, it can we can do a direct query and get data.
This can be answered in different ways. 1st, I'd like to answer what is advanced SQL window function. I mean, what is advanced SQL window function? We can we can create actually with clause function, which is a temporary table and make it callable whenever we required in a script. Let's see. I want to get an aggregation, but I don't want the entire table. I want there are 10 different categorical data. Among those 10 diff categorical, we have 5 5 each categorical. So 10 into 5, 500. Let's say category a, column a, category a, b, c, d. I mean, 10 we have. Column b, category we have Roman number 123. So a has 5 Roman numbers, b has 5 Roman numbers, and so on. Each of those categories, I mean, each of those according to categories are having a marks. I don't want entire thing to be aggregated with. I want only the top performing performance so we can use rank dense rank function, rank the entire dataset in in one. You can use a with clause using a sub query entire thing in a in a I mean, the top performers by rank 1. Then call that, uh, with clause. I mean, with then call that with plus function with CTE function with cte function and aggregate the table. This way we can analyze the complex data aggregation. And we can use cte tables which are temporary tables to call whenever how many times we require in the entire time. Instead of doing multiple aggregation in a one go, we can actually segregate the aggregation in 10 different go, merge it down to 1 and present in our report to the stakeholders.