Bigdata Developer
Blue DartBigdata Developer
Cars 24Developed Spark applications for data validation, cleansing, transformation, and custom aggregation.
Created EC2 instances and EMR clusters for development and testing.
Experienced in using Sqoop to import and export data from and to cloud-based data storage services such as Amazon S3
Proficient in creating and managing Hive tables, including managed, external, and partitioned tables.
Familiarity with Hive query optimization techniques, such as subquery unnesting, predicate pushdown, and vectorization, and their impact on query performance and resource utilization.
Loaded and transformed large sets of semi-structured data like XML, JSON, Avro, and Parquet.
Proficient in developing and implementing Spark RDD-based and DataFrame-based data processing workflows using Scala, Java, or Python programming languages.
Handled Hadoop Map Reduce jobs to process large data sets.
Processed web URL data using Scala and converted it to data frames for further transformations.
Generated complex JSON data after all the transformations for easy storage and access as per client requirements.
Experienced in optimizing Spark RDD and DataFrame performance by tuning various configuration settings, such as memory allocation, caching, and serialization.
Optimized Spark jobs and data processing workflows for scalability, performance, and cost efficiency using techniques such as partitioning, compression, and caching
Developed reusable transformations to load data from flat files and other data sources to the data warehouse.
Developed Hive SQL queries, mappings, tables, and external tables for analysis across different banners and worked on partitioning, optimisation, compilation, and execution.
Responsible for the design and development of analytic models, applications, and supporting tools that enable developers to create algorithms in a big data ecosystem.
Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
Proficient in designing Avro schema for Hive tables and managing schema evolution to accommodate changes in data structure and format.
Strong understanding of Hive serialized data processing performance optimization techniques, such as using columnar storage, data partitioning, and indexing, and their trade-offs in terms of query performance and resource utilization.
Line management of team members and their professional development.
Experienced in using Spark RDD, DataFrame and SQL transformations and actions to process large-scale structured and semi-structured data sets, including filtering, mapping, reducing, grouping, and aggregating data.
Created hive schemas using performance techniques like partitioning and bucketing.
Responsible for continuous monitoring and managing the Elastic MapReduce (EMR) cluster through the AWS console.
Handled Hadoop for accelerating the extraction, transformation, and loading of massive structured and unstructured data.
Deployed the application jar files into AWS instances.
Adept in scheduling and automating Sqoop jobs for incremental runs.
Designed and developed batch processing data pipelines on Amazon EMR using Apache Spark, Python, and Scala to process terabytes of data in a cost-effective and scalable manner.
Developed Spark scripts to import large files from Amazon S3 buckets.
Developed MapReduce programmes for filtering out the unstructured data and developed multiple MapReduce jobs to perform data cleaning and pre-processing.
Strong experience in configuring Sqoop to handle complex data structures such as nested and hierarchical data.
Knowledge of Spark RDD optimization techniques, such as data partitioning, shuffle tuning, and pipelining, and their impact on query performance and resource utilization.
Involved in writing the incremental data to the snowflake.
Development of code and peer review of assigned tasks and bug fixing
Skilled in working with binary and textual data formats in Spark, such as CSV, JSON, and XML, and their serialization and deserialization using Spark DataFrames and RDDs.
Understand and execute change and incident management.
Ability to troubleshoot common issues with Hive performance, such as out-of-memory errors, query hangs, and slow query execution times.
Involved in requirement gathering, design, and deployment of the application using Scrum (Agile) as the development methodology.
Designed and developed Spark applications to implement complex data transformations and aggregations for batch processing jobs, leveraging Spark SQL and DataFrames.
Used JIRA for bug tracking and Bit bucket to check-in and checkout code changes.
Exported data from HDFS to RDBMS via Sqoop for business intelligence, visualisation, and user report generation.