Vetted Talent

Madhumitha Kolkar

Vetted Talent

Seasoned Machine Learning Engineer with 3.3 years of professional experience working with a specialization in Natural Language Processing, Computer Vision, Deep Learning and Generate AI.

Role
ML Engineer
Years of Experience
4 years
Professional Portfolio
View here

Skillsets

Algorithms
programming languages
Python - 5 Years
MLOps - 4 Years
SQL - 4 Years
PyTorch - 4 Years
TensorFlow - 4 Years
LLMs - 4 Years
Transformers - 4 Years
Scikit-learn - 4 Years
AI Agents
Hugging Face - 4 Years

Vetted For

10Skills

Roles & Skills
Results
Details

Machine Learning Scientist II (Places) - RemoteAI Screening
82%

Skills assessed :Large POI Database, Text Embeddings Generation, ETL pipeline, LLM, Machine Learning Model, NLP, Problem Solving Attitude, Python, R, SQL
Score: 74/90

Professional Summary

4Years

Jan, 2024 - Present1 yr 9 months
Machine Learning Engineer
SNOWKAP | POWERWEAVE
Jan, 2021 - Jan, 20243 yr
Machine Learning Engineer
MERCEDES BENZ RESEARCH AND DEVELOPMENT
Jan, 2020 - Jan, 20211 yr
Data Scientist
DELOITTE

Applications & Tools Known

OpenCV
NumPy
Dialogflow
Mediapipe
Streamlit
Pandas
PyTorch
scikit-learn
Android SDK
MySQL
MongoDB
Git
Flask
FastAPI
PostgreSQL
FAISS
Pinecone
AWS Lambda
AWS S3
AWS EC2
Docker
Hugging Face
Airflow
AWS

Work History

4Years

Machine Learning Engineer

SNOWKAP | POWERWEAVE

Jan, 2024 - Present1 yr 9 months

Speech Emotion Detection, AI-Generated Shots, Integrated LlamaParse, Prompt engineering

Machine Learning Engineer

MERCEDES BENZ RESEARCH AND DEVELOPMENT

Jan, 2021 - Jan, 20243 yr

Conversational AI, Predictive Maintenance, YOLOv5 Powered System, BERT Custom Data

Data Scientist

DELOITTE

Jan, 2020 - Jan, 20211 yr

Enhanced Legal Transcription, Speaker Diarization

Achievements

• Exemplary Performance: Recognized as a "Star Performer" for consistently exceeding established company benchmarks by 40%. • Mentorship and Talent Acquisition: Successfully trained and mentored over 15 individuals, and Actively participated in hiring for senior positions (T7/T8/T9s) and fresher, contributing to attracting top talent for company growth. • Open-Source Advocate: Made 4 notable contributions to popular Machine Learning libraries like Keras, TensorFlow, and OpenAI Whisper, actively promoting collaborative development within the Machine Learning community. - Speaker for Google , Conscious Algorithms : A Talk on AI Safety.
Star Performer award
Google Dev Conference speaker
Mentorship of 15+ individuals
Presented expertise at Google Dev Conference
Star Performer
Mentorship
AI Safety Presentation

Major Projects

11Projects

QuantumAR

Developed an Augmented Reality application with OpenCV and Python for real-time feature matching and dynamic object replacement.

Noah

Designed a smart chat/recommendation bot using FastAPI, Python, Dialogflow, MySQL, and NLP for a custom shopping site.

AirFlow

Developed an interactive gesture-based Air Canvas using ML, Mediapipe, and OpenCV for tracking hand movements and recognizing gestures.

paperScribe

Architected and implemented PaperScribe, a RAG AI system powered by GPT-3

Moodmap

Engineered a novel Multimodal AI system for real-time emotion analysis

SigSafe

Implemented a Siamese network architecture integrated with a CNN model for signature verification

SayWhatNow

Developed a deep learning-based custom next-word predictor utilizing LSTMs

Notii-fy

Feb, 2024 - Mar, 2024 1 month

Engineered Notiify, an ad-free, local music player application inspired by Spotify. Utilizing Python and speech recognition, enabling hands-free music control by allowing users to voice-activate song playback. Attained an average voice command response time of 0.88 seconds.

OpinionSense

Jan, 2024 - Feb, 2024 1 month

Coded and implemented OpinionSense, a sentiment analysis system for reviews using a Recurrent Neural Network (RNN) from scratch. Achieved an F1-score of 0.85.

Cropcure_AI

Jan, 2024 - Feb, 2024 1 month

Created CropCure AI, a Flask application leveraging Deep Learning for real-time Blythe disease detection in potato leaves. Gained an F1-score of 88%

MK_LLM

Jan, 2024 - Feb, 2024 1 month

Devised a bigram character-based architecture, achieving a competitive F1-score of 87.2 on a held-out validation set.
Trained MK-LLM on a subset of OpenWebText, a massive text dataset comparable to GPT-2 training data.

Education

Bachelor Of Engineering - Computer Science
SDMCET Dharwad, KA (2020)

Certifications

Machine Learning
DeepLearning.AI- Stanford (Apr, 2024)

Interests

Filmmaking

Travel

Art

Photography

AI-interview Questions & Answers

So, hey. Uh, my name is Madhavita Pulkar, and, um, I am a machine learning engineer with 3.3 years of experience working on natural language processing, computer vision, and generative AI and speech recognition. So I have worked in 2 companies before, and my previous organization was Mercedes Benz, where, um, I worked as a machine learning engineer on projects projects mainly related to natural language processing and computer vision and generative AI. And before that, I was working at um, I was working at Deloitte where I worked on speech recognition. So that was, uh, speech to text for a legal firm where we were trying to create a product, um, that would help them, like, convert recordings of, um, like, legal hearings into speech because they had issues where they had human errors. And when it comes to the work at Mercedes, so there was one project which was related to creating a customer service related chatbot for for booking online appointments. And, um, this was actually useful because it used to cut down an overhead charge that used to happen when people would miss out on appointments, and we would still have to pay, uh, the technical staff and a lot of use cases. But mainly for this, what we did was we built an in house, um, LSTM, which was an encoder decoder based. So it was capable of doing, uh, mainly 3 different kinds of things where it could intent to classify the team which, um, the issue is supposed to be routed to. It could classify at like, it could predict whether the issue was self diagnosable or if it was something that required, um, a service center, uh, appointment. And if it was self diagnosable, then it used to learn from a subset of data that we had given, which was related to, um, diagnosis or hot fixes for certain issues, quick fixes, and it used to be able to do that. So after doing this, we were actually able to get the funds for scaling this up. And once we scaled it up, we took it to, um, Google's Vertex AI, and we used BERT along with it. So we created this, like, use transfer learning and created a and sample based learning for it. So, yeah, I have experience with large language models, um, productionizing it, deploying it. And even when it comes to computer vision. So we used Yolo for, um, object detection. So the basic, um, whole thing behind this was we were trying to automate the UI process of testing, which, um, engineers actually had to do manually because there's a lot of security in the car, and we can't directly write test scripts, uh, like, scripts to automate the whole thing. So we had to automate the UI. So for that, we used, uh, Yolo, and then for more fine grained classification of icons and stuff, we used ResNet, ResNet 50. So, yeah, it was CNN based image classification, and then we used Python scripting for automating the whole thing once, um, image classification was done properly. So, yeah, this is mainly that. And then, yeah, generative AI research, and we were building applications. We're trying to build it and integrate it into our systems. And so, yeah, that was there were a lot of learning aspects to that too. Um, again, I've worked on, like, uh, rafter hide, retrieval augmented generation, basically, to come up with a nice, um, application use case for this. That is, uh, experience.

K. Suggest a method to automate data verification in an ETL pipeline for consistency. How would I answer this? So pipeline is basically like extract transform load pipeline. Right? So if we wanna achieve consistency, then, um, we could follow follow a lot of methods for this. So, basically, to automate the whole thing, I think we would start off with something like profiling the data. You wanna understand the structure, the content, the quality of what kind of data you actually have. And for this, you could use tools such as, like, um, Informatica Data Quality or Griffin, and maybe, like, implement some kind of data profiling tools such that we can gather these statistics and identify patterns within the data. And then I would go for something like validating the data so that we can define, like, some fixed rules to make sure that, um, our data adheres to these transformation sort of rules. And I think for that, I would just go with, like, using SQL script or, um, yeah, I think I would go ahead with SQL script. And so I would initially start by, like, creating some sort of rules, um, putting some kind of constraints, And then I would try to maintain some kinda, like, integrity and mandatory fields. Like, this has to comply to this. And, yeah, I would apply these at all stages, like, before ETL, during ETL, and even after ETL so that I can keep verifying the data. And then I would start automating these testing frameworks, you know, to implement something that can run automatically and people don't have to use it, uh, like, keep doing it. So, um, I think we could use something like, uh, Amazon Web Services for this. You know, there is there is DeepQ. So we could use DeepQ. And, um, so, basically, we wanna set, like, expectations of what we want, and then we can integrate these tests and make sure that our, like, our model is fulfilling all of this in the pipeline. And it can run, like, specifically between certain intervals on its own. And after that, monitoring, you know, you have to keep monitoring to make sure that the pipeline is working properly or it that it alerts you when there are certain inconsistencies. So I think for this, we can use tools like Grafana. I think, uh, AWS Cloud has, uh, something called CloudWatch, which can also be used. So we basically set up some kind of tool for checking the ETL process status, the data quality, um, based upon, like, certain metrics that we could have defined. And upon that, like, um, we can decide if we need to make changes or something is not working properly. Apart from that, I think we would have some kinda, like, uh, you know, auditing sort of thing. Like, uh, you can use something like Atlas or Informatica again. So you would you'd basically create something that would keep track of the transformations for the data and maintain, like, logs for it. And on top of that, I think, um, we basically use, like, version control again, like, uh, git Jenkins or something like that. So by using version control, we can just make sure that we're always, um, up to date, and we can always roll back if we want to. So we can just set up, like, some CICD pipeline to test again, deploy the ETL. So

How can an LLM be utilized to enhance an existing NLP based system? Okay. So so yeah. Um, I could relate this to my personal experience. So like I mentioned, initially, we were using our own in house LSTM, uh, which was an encoder, decloder based, and then we trained it upon a lot of data that we got. And later, as I mentioned, once we scaled it up, we started using we started using Word. So, um, like, in that case, uh, obviously, large language models have a lot of benefits and implications because they can significantly enhance the amount of capability that a a model we can have can make because the amount of data that they are trained on is obviously humongous. And, um, it's learned so much that it's hard for us to come up with a model, a local NLP model that has that amount of data. So what, um, could be certain benefits is, like, you can do text generation and some generation and summarization because it improves the quality and, like, enhances the text generation. Because if it's understanding, like, the embeddings are much more sophisticated, and there is a variety of things that it knows. So it enhances that. So if I use something like GPT 3.5 or 4 for generating high quality coherent summaries for something, that would significantly prove very beneficial compared to something that's local. Right? Like, these systems are much more sophisticated. Like, um, yeah. So the main thing would be, like, using something like transfer learning would be good because it would reduce how much work we have to do and train it again because we already have things that are learned. Um, it would be, like, good for automating certain things like content creation for blogs sort of way, you know, because you have so much data that you can work on. Um, LLMs also have this really good understanding of language. Right? So if I'm using itself for something like sentiment analysis, that would also prove a higher level of accuracy, like, if I leverage LLMs. Because sometimes, you know, sentiments have this kind of nuance. You don't really realize what the user is trying to say. And for me to train something from scratch for that is a bit hard. But if I use an existing LLM and it's pre learned knowledge, it would be very useful useful. Um, it can also be used for certain things like feedback, if I wanna analyze some customer's feedback or, uh, even in cases when there's, like, named entity recognition. Um, it I think it would, uh, help in enhancing that. And other use cases would be, like, you could have machine translation, you know, um, and making it more fluent. You're making your model more fluent. And so, yeah, basically, integrating this LLM into your NLP model, it would be good. Um, then we have, what, conversational AI and, like, our chatbots, basically. So if we have LLMs integrated, then we can actually increase its natural level under net, uh, natural language understanding and its context aware responses, um, which is good. Then you could use it for, like, classification or modeling, um, yeah, um, information retrieval in the case of, like, retrieval of minute generation for, like, q and a. So so, yeah, basically, there are a lot of things and mainly the fact that it's trained on so much amount of data that it gives you that upper edge upon your base model. So yeah.

Device's strategy for implementing an SQL based solution for real time POI metadata enrichment. Okay. Okay. So for a point of interest, metadata enrichment. Let's see. Uh, well, this would be, again, like, a multistep approach for this. I think we would start again with, like, normal flow, like, data acquisition, real time processing, like, enriching the data, which is, like, sort of, like, augmenting it and then maintaining a database. Okay. So the first step I would do is go with data acquisition. Collecting collecting your data is the first thing that you're supposed to do. So I would collect this data for my point of interest and, um, maybe for lit from, like, internal and external sources just for enrichment to make sure that I have a good set of data. Um, I would collect, like, primitive primary data. Like, it that would include, like, basic details, maybe, um, names, addresses, coordinates, or stuff like that. And then I would go for something a little higher, um, maybe like external data, uh, where I could, like, identify and integrate this external data source just to enrich my data collection. Maybe I could go on a higher level to something like demographic information, um, weather information, or maybe, um, click social media check-in, something anything like that. Um, then okay. So I have the data now, and I would probably wanna integrate it. Right? So, um, what I can do for this is I could, like, design an architecture to handle real time data processing. So, um, basically, what I I'd wanna do is, like, we talked in the previous question, something like ETL tools, which could be used like, um, message queue, SQL databases to manage the flow of the whole data. Right? Uh, probably use Kafka or MySQL databases and real time streaming. Right? Uh, then the next step would be to go for real time data ingestion. Right? So if I wanna ingest this, uh, point of interest data and these external, um, sources, the real time data that I got, then I would have to use, like, some streaming process stream processing, uh, maybe go for Kafka, um, to get all this data from external sources as well. Uh, then I would try to enrich the metadata that I have. Um, maybe, like, use a SQL and stored procedures to enrich the data, uh, while it's ingested, and we could probably employ something like batch ingestion to do this. And after that, um, one of the major steps would be the database schema. Right? So you wanna create, like, tables for the POIs, external data, and enriched data. You wanna index it, and then you wanna optimize your query performance. Right? And then you wanna go for, like, real time, um, processing of the logic. So you could, like, write stored procedures to handle the, um, enrichment logic or, like, create triggers for it for your insertions and updates. And, basically, at the end of all this, you wanna do your monitoring and alerts. Right? So you could probably use tools like Grafana for this or Prometheus. So you just wanna monitor the data pipeline, and you can set alerts when something is not performing well or you just have some bottlenecks for this. So, yeah, that's the flow that I would do it.

Propose a technique for incorporating vector database technology in a POI matching algorithm. Okay. Technique for incorporating vector database technology. Okay. Okay. So, um, what I understand from this is when you're saying vector database technology, um, it's mainly because we wanna categorize things that are close by to each other. That's what we do with vector DVs. Right? So you wanna find similar and relevant POIs, like, based on some kind of features, maybe. For example, it could be, like, descriptions or, like, geography, um, or certain attributes like that. Right? Okay. So, um, the first step, as always, I would go with data preparation. Um, so I wanna prepare data for my, like, my POI data and extract relevant features because I need that to create embeddings. Right? So I'd collect that data, including, like, um, names, addresses, coordinates, certain things, etcetera. Um, then what I would do is I would go for feature extraction. Like, I wanna extract meaningful features from that data. Right? This could probably be, like, something text format. Features, descriptions, or some categories that's already mentioned, maybe demographic data or certain other attributes. I don't know. Maybe like feedback ratings or something like that. Um, when I get these features, what I wanna do with the features is I need to create my embedding. Right? So I need to convert this POI features into actual vector embeddings. So for this, I could actually go for using pre trained models. Maybe that would be much better, uh, useful. I could go for, like, BERT, GPT, or some custom trained model so that I can current convert this textual data into vectors. I could use fast text too, maybe. And this would be for my text based stuff. So what if I have certain things that were, like, geographical based? Then I would wanna convert latitudes and longitudes into vectors. Right? So that's, again, that's spatial data. And then I would combine both of these together where I have my text data and my spatial data, and it's everything is represented into, like, a single vectorized format. So I wanna create a vector DB for this a database for this. Uh, and I need to set up that vector database to, like, store and manage this these embeddings. Right? So I I think for that, um, what I have personally used is, um, Faiss, f a I s s. Yeah. Um, so Facebook has this embedding database. Right? And other than that, there are other ones too, like Pinecone. Um, I haven't used it. But, yeah, you could, like, design a schema and then store your data along with its metadata onto that. Then you wanna go for data ingestion. Right? Um, you could do, like, batch ingestion, real time ingestion, then I would do a similarity search, um, similarity search algorithm. Uh, I think I could use something because I need to find the, um, proximity between them, something go for, like, Euclidean distance maybe or or cosine similarity, something to find that out, or, like, k nearest neighbors, some algorithm. Um, then I would have to have, like, a logic for my application. Right? This would involve, like, querying or processing or user queries that need to be translated into the vector search operations. Right? So I'd have to write something for that. And, yeah, I I basically have to handle these requests properly if it's the meta metadata and display, but this is how I could use this, uh, vector database with embeddings.

Describe a system designed to automate the recognition and flagging of outdated PUI listings. Okay. So this would, like, involve setting up a system that would be continuously monitoring. Right? And we needed to evaluate and update, like, the status of the point of interest. Um, this would actually depend upon a various amount of data, uh, interactions, detailed descriptions, yeah, whatever we can get related to it. So that depends initially on the data source. Like, I need to collect it to assess all that information for the status of the POI. So that would be like, um, I'd go from my primary data source. Like, that would be original listings from my database, and I could go for, like, external sources where I could get, like, social media information, maybe check ins, websites, government records. I could go for geographic data. And what I could do after that is I I I need to come up with, like, a system architecture. Right? So, um, I need a data ingestion layer that's gonna collect data from these various sources, and then I need to process it. So I'd have to have, like, a processing layer, then I need to store it. So after you store it, you'd need something like an action layer. Right? Like, I can notify and flag outdated listings. Right? Um, or I can send, like, a trigger when there's something that's not right. Um, so when it comes to, like, data ingestion, like, I need to continuously keep collecting and then updating data from all of these various sources that I have. So I I think for that, the best thing would be, like, an ETL pipeline to use, like, airflow or some Apache tool for it. And then I could do, like, real time data streaming using Kafka or, um, ingestion APIs or social media platforms for that. And then when I do have the data, I need to do my feature extraction, basically, my engineering and data processing. Right? So I'd have to be extracting relevant features from the ingested data that I already have. So I think for this, I could go with natural language processing. Right? I could analyze the text, review it, and then, um, mention things that I have detected. Um, and then I could also do image processing if I have image data so I could analyze the images from, like, social media stuff, and then I can detect, like, some visual cues that I'm seeing in it. Or, like, geospatial analysis if I have, um, like, if I have information related to, uh, the location and stuff or, like, activity monitoring, maybe if I had a business or something. And then when it comes to my outdating outdated listings, I would have to propose, like, for what is actually considered to be outdated. Right? So for that, I need to set certain rules. And for my rule based detection, I could do stuff like when the time stamp was last updated. Right? You could check that. Or you could use, like, machine learning models, like classification models, and maybe to classify what is actually outdated so that it could learn from that or, like, anomaly detection to see these outliers. And then you would start flagging and notifying. Right? Um, so you'd have these database flags and a notification system to tell them. You could set up automated triggers for that. And, yeah, basically, the whole thing, you have to do a monitoring for this. You'd have some performance metrics, maybe track the metrics, have some feedback mechanism or something. So you just continuously keep monitoring, getting the data, and just keep updating it.

Select name, location, category from POIS where category in hotel restaurant and location is not null. Order by name descending limit 10. What is it trying to accomplish and identify if there's any issue with that. Okay. Let's see. So the SQL query is written for selecting POIs written data with certain attributes, explain what it's trying to accomplish, and identify if there is any issue. Okay. Uh, so this SQL query is trying to retrieve a list of BOIs, um, that belong to a category of hotel or restaurant and that doesn't have a null location. Right? So it's filled, and it's trying to sort the results by the name of the POI in descending order, and, um, it's limiting the output to the top 10 entries. Yeah. So when I have to find issues in this, What do I say? I think we could go for, like, case sensitivity. No. Everything seems to be fine. Okay. So what we're doing here is we're assuming column names. Right? So I I don't really know if those are the actual column names or categories. And if these don't really match, um, exactly, then it would result in an error. Okay. And then I see an issue with the order by clause over here. So we're sorting by name and descending order. I don't think that would be the most useful way to represent the POIs depending upon this use case. But this is definitely a design decision. Um, and then I would think depending upon the size, uh, like, size of the POIs table, uh, filtering and sorting might impact the performance, like, ensuring that there's appropriate indexes on the category location or the name of the columns, um, that that would help query, uh, the performance. So so yeah. Basically, if I had to sum this up, like, I'd say this query is designed to fit filtered and sorted list set of the POIs. Uh, recording it, like, you know, ordering it by the hotel and restaurant. And, um, um, so you're trying to validate, like, the locations that it actually exist. You're sorting it by name, and you're trying to limit the output. So you're trying to get the top ten ones, which are categorized in this manner.

Define match POIs. F p y a, p y b. If p y a lower dot strip is equal to p y b lower dot strip and length of p y a is greater than fire. Length of PYB is greater than fire. Return true. Return false. Okay. What are we trying to do here? Given this Python function meant for matching POI names, can you spot any logical errors that might cause incorrect matching? Okay. So we're defining a function to match the POIs. We're given 2 functions, p o I a and p o I b. If a dot lore dot strip. Okay. So we're removing empty spaces. We're combining everything together is equal to b dot strip. Okay? And length of POI is greater than phi or length of POI b is greater than phi. Return true. Else return phi. Okay. You spot any logical errors that might cause that might cause incorrect matching. What strip is equal to pure bead of the more dot strip. Can you spot any logical error that might cause incorrect matching? Okay. We have 2 here. F a dot lower dot strip is equal to b dot lower dot strip. Okay. So I think when we're we're doing this, um, we're doing this check over here to check the equality, I don't think checking the length of both these names individually again is necessary. It just seems a little redundant because if we're gonna go to that case, I mean, they're already equal names. Equal names would already be equal in length. Right? So this could just be simplified, and we don't really need to check that. Okay. Strip and lower are arranged. Okay. They're done at the same time. So there is really no error handling mechanism here. Like, we're not really catching any errors. If there are any anything for that, we should have, like, a try catch. Yeah. But I think that would be one of the major issues, uh, yeah, for in the logic. We really don't need to check the length of a and b because if we're checking an and condition later, then that means they are both equal for it to have gone to that next condition. And if they're equal, then, obviously, their length is equal, so we don't need to check the length of both of them separately.

Formulate an approach for creating a high accuracy machine learning model in R to predict the PY popularity based on various factors. Okay. Well, I'm not specifically very great at r. Um, I'm more of a Python person, but, yeah, I can try attempting this question. Uh, so I think this would involve something like model selection, training again, and evaluation. So I think for, like, um, I would start off with data collection because for everything, That's always what we used to do first. So, initially, I would, like, gather my data from various sources, you know, social media check ins, geography, everything that I've mentioned before. I would do the data integration, you know, combine these datasets into, like, some cohesive format, which would be suitable for my analysis purpose. Um, I would preprocess the data, you know, like, handle missing values, do feature engineering, you know, try to get something that's more valuable and less redundant, maybe the total number of reviews, average ratings, certain things like that. Then all of these features that I get, I would normalize them, um, into, like, 0 to 1 scale maybe so that this would later help when we're trying to find, um, similarity or our minimum. Then I would try to explore the dataset which I have now, like, um, EDA. I do, like, exploratory data analysis for it. This could be through, like, visualizations or using, like, plots or graphs to see it so I can see the relationship between the features and, like, the target variable. So I would do, like, this correlation analysis to see, um, how these features are related to each other. And after doing that, I'd have to go for, like, a good model selection. Right? And, um, that's, like, basically choosing the algorithm, then I would have to consider most of them. Like, you can go for, linear aggression, decision trees, random forest, gradient boost, SVMs, neural networks. There are so many things. Right? So you need to consider them all of them. Like, we have to do something. So I think we could use, like, grid cert TV for that. Uh, then I would have a cross validation set so that I could check it out later and see the performance performance between these models. Like and I wanna do fine tuning so that there is no overfitting. Yeah. So once I select the model, then I need to train it. Right? So I would split the data. I would train it. I do my hyperparameter tuning once I start realizing which model is performing well. And, yeah, I do the whole training, fine tuning, all of that, then check it on my cross validation data for the fine tuning, then try my test data. And when it goes to evaluation, yeah, um, I like I I'd have to use the right evaluation, uh, metrics. Right? So this could be, like, mean absolute error depending upon what my data is in the nature of what I'm trying to do, or I could go for MSE, mean squared error, um, maybe RMSE or, uh, yeah, in one of these metrics, which I feel is good, and then I would validate it to see, like, what is giving me the minimum loss. Then I'd go ahead and, like, deploy this model. Right? Um, I'd use this train model to, like, predict the POIs, popularity for, like, new data that's coming in. And, yeah, I'd monitor this and, like, continuously monitor the model's performance and retrain it if it's necessary periodically, like, with new data to maintain this accuracy.

Device a plan to optimize SQL query is used in ETL process for greater efficiency without sacrificing data integrity. Okay. Um, let's see. Okay. It says optimize SQL queries, meaning that there are already queries in the queue, um, without sacrificing integrity. Okay. So first of all, I would I would wanna, like, analyze and understand what the current queries are. Like, um, I would profile it. Maybe use tools, like, explain or analyze because I wanna understand what the execution of the current queries actually is. And then I would wanna identify, like, bottlenecks. Maybe look for queries that are running slowly or frequent scans which are redundant and that are, like, leading to some high input operations or taking too much energy. I would check for that, analyze it first, and then I would wanna, like, index my optimization. Like, use some kind of appropriate indexing. Like, maybe ensure that, like, I'm using where where join and order by clauses, like, that they're properly indexed. Um, I'm not jumbling them around. Uh, so, yeah, index maintenance. Like, I would regularly wanna make sure that they're, um, indexed properly. Um, then I could, like, optimize the query itself, like, um, maybe use, like, subqueries or, like, common table expressions. Yeah. I'd wanna, like, replace the subqueries with the CPEs and, like, if they provide something that's better readability wise. Uh, I would I know that I would wanna affect like, um, I would I would wanna, like, reduce the impact or the usage of select. I would wanna avoid it actually because, um, I only you'd wanna use select for, like, necessary columns, and I wanna reduce the amount of data processed and transferred. Right? Um, then I'd make sure that, like, my join, inner join, outer join, left join, right join, all of these are, like, uh, they're optimized and appropriately used. I would wanna, um, minimize any redundant calculations that I might be doing and maybe partition the data. If I'm using, like, large datasets, then I would probably wanna partition it so that I could get more data efficiency, like, from smaller blocks. If I can query from that, that would be much better. Um, then I would try to optimize, like, my storage, maybe do normalization, um, normalize the tables to reduce the data redundant redundancy. Um, I would wanna archive it, like, so that it can have historical data archived and cached and that it's for for it to be, like, frequently accessible to separate separate storage to, like, improve this query performance and active date on active data. Then I could do, like, batch processing, parallel processing using GPUs for this, and I could do, like, hardware configuration, maybe my database configuration. Um, I can handle my database configuration, and then I would wanna monitor the whole thing, like, you know, regular monitoring, automated alerts, like, keep triggers, performance tuning. Yeah. So just.

What approach would you take to integrate LLM in an existing Python based ETL pipeline for enhanced NLP processing? Okay. Okay. So I have an existing Python based ETL pipeline, and I wanna integrate an LLM into it. Okay. So first, I would wanna check, like, what exactly is the use case. Like, I wanna set some kind of objective, like, the specific tasks or cases where LLMs could actually enhance the natural language understanding, like, um, if I wanna do text extraction or, like, sentiment analysis or something, you know, named entity recognition generation, text generation, machine generation, something like that. Okay? Uh, later, I would wanna choose an LLM that's specifically good for my purpose. Right? Like, I have a wide variety where I can choose it from, like, uh, if I want gpt 3 or gpt 2. Like, I I wanna decide it based upon, like, my requirements. Um, so once I've selected that and I know what my use case is, then I would wanna go for, like, the integration steps of this. Right? That could be, like, setting up my environment if I need to, like, install dependencies, um, like, uh, for, like, my OpenAI related API keys and other libraries that I may need for, like, TPD 3. Then we have our authentication where I'd have to verify the API key for the tokens related to using these LLM services. Um, so I've done all that. I would wanna, like, maybe preprocess my data so I can prepare it well and, like, clean it and code it to be used for the LLM. Okay. Then I need to, like, integrate my API, right, um, with the EDL pipeline. So because I wanna I wanna process text data. Right? So this could involve, like, uh, sending request to the API and then handling the responses. Then I need to have, like, something for error handling so I could, like, implement robust error handling mechanisms to make sure that the API does not fail. Um, we could consider scaling, ensure, like, the pipeline can scale to handle larger volumes of text data efficiently, um, considering, like, API rate limits or performance requirements, some kind of metrics. Then I would go to, um, after this validation, testing it, validating it. So I I would, like, write, uh, unit test cases, integrated testing, like performance testing, uh, anything. So I wanna validate the end to end functionality of the pipeline. Right? So after I've done all of this, I've done it, deployed it, and I wanna test it, then I would go for, like, for my final stage where I need to monitor and maintain the whole thing. Right? So I would do this by, like, keeping some kind of logging mechanism, um, regular maintenance. You know? Like, you keep checking, um, for updates or changes maybe to the APIs of the LLM, um, and if there are any dependencies or updates that are blocking you if you need to do that, or you could keep, like, some feedback loop, um, so you can maybe ask your users to give you certain feedbacks or maybe stakeholders if you have any. And this could lead to, like, continuous improvement on the pipeline and integration. So, yeah, that's my answer.

Madhumitha Kolkar

ML Engineer

4 years

View here

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Machine Learning Engineer

Machine Learning Engineer

Data Scientist

Achievements

Major Projects

QuantumAR

Noah

AirFlow

paperScribe

Moodmap

SigSafe

SayWhatNow

Notii-fy

OpinionSense

Cropcure_AI

MK_LLM

Education

Bachelor Of Engineering - Computer Science

Certifications

Machine Learning

Interests

AI-interview Questions & Answers