Discover the power of our data conversion lead for Engineering

Empowering businesses with a cost-effective solution, airSlate SignNow's data conversion lead for Engineering stands out for its great ROI and ease of use

airSlate SignNow regularly wins awards for ease of use and setup

See airSlate SignNow eSignatures in action

Create secure and intuitive e-signature workflows on any device, track the status of documents right in your account, build online fillable forms – all within a single solution.

Collect signatures
24x
faster
Reduce costs by
$30
per document
Save up to
40h
per employee / month

Our user reviews speak for themselves

illustrations persone
Kodi-Marie Evans
Director of NetSuite Operations at Xerox
airSlate SignNow provides us with the flexibility needed to get the right signatures on the right documents, in the right formats, based on our integration with NetSuite.
illustrations reviews slider
illustrations persone
Samantha Jo
Enterprise Client Partner at Yelp
airSlate SignNow has made life easier for me. It has been huge to have the ability to sign contracts on-the-go! It is now less stressful to get things done efficiently and promptly.
illustrations reviews slider
illustrations persone
Megan Bond
Digital marketing management at Electrolux
This software has added to our business value. I have got rid of the repetitive tasks. I am capable of creating the mobile native web forms. Now I can easily make payment contracts through a fair channel and their management is very easy.
illustrations reviews slider
Walmart
ExxonMobil
Apple
Comcast
Facebook
FedEx
be ready to get more

Why choose airSlate SignNow

  • Free 7-day trial. Choose the plan you need and try it risk-free.
  • Honest pricing for full-featured plans. airSlate SignNow offers subscription plans with no overages or hidden fees at renewal.
  • Enterprise-grade security. airSlate SignNow helps you comply with global security standards.
illustrations signature

Data conversion lead for engineering

Are you searching for a reliable solution to streamline your document signing process? Look no further than airSlate SignNow by airSlate! With airSlate SignNow, businesses can easily send and eSign documents, making it a cost-effective and user-friendly option for all your document management needs. Whether you're in engineering looking for a data conversion lead or any other industry, airSlate SignNow has you covered.

Data Conversion Lead for Engineering - How-To Guide

Streamline your document workflow with airSlate SignNow and experience the benefits of easy document management. Try airSlate SignNow today and see how it can transform the way you handle documents in your engineering projects.

Sign up for a free trial of airSlate SignNow now!

airSlate SignNow features that users love

Speed up your paper-based processes with an easy-to-use eSignature solution.

Edit PDFs
online
Generate templates of your most used documents for signing and completion.
Create a signing link
Share a document via a link without the need to add recipient emails.
Assign roles to signers
Organize complex signing workflows by adding multiple signers and assigning roles.
Create a document template
Create teams to collaborate on documents and templates in real time.
Add Signature fields
Get accurate signatures exactly where you need them using signature fields.
Archive documents in bulk
Save time by archiving multiple documents at once.
be ready to get more

Get legally-binding signatures now!

FAQs online signature

Here is a list of the most common customer questions. If you can’t find an answer to your question, please don’t hesitate to reach out to us.

Need help? Contact support

Trusted e-signature solution — what our customers are saying

Explore how the airSlate SignNow e-signature platform helps businesses succeed. Hear from real users and what they like most about electronic signing.

Made Hiring so Much Easier
5
Anna S

What do you like best?

Made our onboarding so much easier. New hires are able to send information and get in faster! It is so much easier to be able to send this to a new hire. Now we are able to send this to them and we can see who is coming in before and prepare for our day. Spend your time on training instead of filling W2 all day. Also cleared up so much room in our filing cabinets.

Read full review
I love the ease & convenience of airSlate SignNow
5
Bruce E

What do you like best?

I love the ease & convenience of airSlate SignNow. It is user-friendly — and just as easy to use on my phone as it is on my desktop!

Read full review
airSlate SignNow is so helpful for any type of biz
5
Agency

What do you like best?

It’s so easy to use! We upload our agreements, contracts, accounting paperwork, waivers, etc. then add a few quick fill in or signature spots and send it off to clients or vendors for signature. Easy peasy. And we love that we always have a record of signed docs showing when they were signed for our records. And the reminder send is great for forgetful or busy signers.

Read full review
video background

How to create outlook signature

[Music] [Music] [Music] thanks brian yeah thanks brian for this opportunity and thanks for that great intro so hello everyone uh so brian gave me some of the background that are what i do now so currently i work uh as i can say i worked as a data engineering lead for a company named kafu in dubai and previously brian and we used to work together for a company called property finder so brian is was in geo uh geospatial space and uh i was in data engineering and we worked on a couple of projects together setting up the jio infrastructure and coming anything that's related with data this kind of interaction we used to have and that's where we developed like our relation on the professional space and we used to do a lot of pre-installment together uh while working while working there uh prior to that uh i was in the united states i did my masters in computer science and after that i did couple of internships like i heard a couple of many of uh people who are present here they are doing their masters and that also parallel doing internships in geospatial area so i did internship at qualcomm [Music] as a integrated analytics team performed a lot of network data analytics uh it was all related to uh called related to we what we in internet that we use today like let it be 3g 4g 5g now 5g is rolling out but i was working on that type of data like back in 2017 so it means that back in 2017 the work on 5g was going on and slowly instantly now it's getting rolled out and after that internship i also got a full-time opportunity in the same company and i got to grow my data skills which actually landed me a single data engineering job in dubai and i transitioned now to a data engineering lead position so uh today i will be specifically talking things and walk you through that what data engineering space looks like how you should transition your career or what tools and technologies you should be working on what are the core skills required what's happening in the industry how do we consume data where do we like to normally store it how the processing happens uh how the etls and elt things are being set up and then how the things are being presented so that the management people the key stakeholders who are usually the c c level guys in your company they are only interested in the visualizations and insights so that they can take key business decisions and that's how the data industry is moving today and yeah we will uh touch base on some of those aspects and i hope it will uh help you too and you will enjoy the session along and after that at the end we will open up for questions and uh anything you want to know more or in depth we can discuss during that q and a session so uh so how will i structure this uh is uh i'll start from uh like what uh i mean i don't have any kind of powerpoint slides uh because just i want to give you the real world perspective i don't want to show you some bunch of any text and things sometimes it also gets kind of boring so i will be realistic and what's actually happening within the industry that's all i will speak about so right now uh as being a working as data engineer at kafu so well presented in terms of story so when i was hired uh i was being given a very big task the company itself they didn't have any kind of data platform so the data platform itself was missing uh and i was the first guy in the company to be hired in a data engineering and how the analytics used to look like is we had power api so power api is a visualization tool if some of you have not have not heard it's like tableau quick quick site and there are so many others and it's used for building dashboards so how the analytics was performed is we have couple of analytics guys in bi team and the power api was directly connecting to the read replicas so in short terms they were directly connecting to the production sources uh and what was the downside of such setup is you can't do any kind of complex analysis uh since it was since it's the start phase of the company the company is just two and a half years old so for certain time it went well it was okay uh writing some queries within the bi itself and the uh the query runs against the actual sources and you get return the data is returned you do visualization it's fine but soon as the data grows uh you will start uh you'll start facing problems and also once your data sources grow so your data source is not only lies in an rtb ns so they come from event systems different kind of event systems there is also iot so even system when i say event systems like for example clever tap so the clever tab is responsible like it's integrated with your can integrate with your mobile application and whatever events are going on within the application for example you are reviewing something you click something so all such data can be recorded to do send some kind of user behavior analysis that how you are using the application and from that we come to know that what in the what should be changed in the application itself how we should improvise in certain area all these analytics has been done from the data so as the number of data sources uh becomes increases so then such site such kind of setup will not at all work and you will be limited and on the set of things that you can perform uh so that's where the challenge came in uh to have a such kind of architecture where things can scale and you can store in more data process more data and then have the capability i mean give the power to data scientists analytics team to do data explorations join across various sources let it be structured and structured semi-structured and they can perform i mean the performance should be good and they should and then they can work on variety of use cases they have in their mind so with that kind of discussions so what architecture i came up with was uh which is now very uh quite popular and has become quite popular in the data industry but in the software world is setting up a data lake so the term data like what does it mean uh since many of you guys you are already working in geospatial area and you work with data you might know that that the companies or or within your curriculum too since most of you are also in masters uh if you have taken any database courses uh you might have come across this terms data lake data warehousing uh and also now there is new term that's been called data lake house we'll come to that but for those of you who don't know so data lake is a central repository so what does it mean is you have 10 sources in your company from which the data is being generated you get the data from all these 10 sources and you store in a central repository in its raw format so you don't do any kind of processing over it so you store the data as it is so as it is which lies in their actual sources you get the data from there and you store the data in a central repository so that is what called being termed as a data lake and you have its kind of a so to simplify it you can imagine a library so when you go to a library you have a librarian and you have information like the books related to so many uh particular areas relative politics technology uh in arts history so there are so many other streams so this can be considered as different kind of sources of data sources of information and you can relate it to these data sources in our company and and all these things to is being stored at one central location and that central location is the library and to manage it to administer it uh to take care of the library to take care of the books uh to order new books so you can imagine ordering new books means getting new data sources getting new data sets so it has to be you need someone to perform these activities so that and that activity is being performed in a library by a librarian and that librarian you can relate it to a data engineer so what a data engineer does is responsible for making sure that the data wherever it lies it's secure then where as and when the data sources increases you get that bring the data have the processes develop to get the data from the source store it in the central location uh administer the data have the access big controls being set up over the data uh and all the necessary things i mean and created by encryption whatever like around information security so he is the guy data engineer who does all this work and usually they are being also called as plumbers of data science uh because they deal with like what i call it as a data sheet because plumbers are responsible for working the the hardcore work that's what the data engineer does in a data team and the data can become like mess become very messy and all these things has to be handled uh by data engineer and with that you need to develop your uh quite uh you can become quite complex and you should you need to have that level of skills uh so that you solve those issues so once the data lies in a like a data lake so that's when all the processing uh usually happens in today's industry so now what you did is you shifted your processing right from the sources to the data lake uh the advantage of this is that you have data from all the sources at a central location uh so though that one task that becomes easy is to view the data across sources and to join it across different sources that becomes very much simpler and it's kind of a power that bi people and data scientists usually you know you know in any form in our company they needed that kind of thing and due to that they were like they were able to do a lot of complex analysis let me because we are in fuel delivery space so we had a problem around routing engine we had to optimize the routing engine because we had trucks going around in the field and we have to deliver orders so predicting any kind of eta also deciding the route for the truck all these things can be optimized only you you have if you have enough data so now after collecting data from so many other event systems or third-party systems wherein like our track fleet io wherein all the truck information is stored then also the iot we have iot data we have a lot of sensors on board truck related fuel level sensor fuel dispenser uh and lot of other parameters related to the truck so all this is being uh it comes through the data pipelines and it blends in greatly so the data scientists and analysts they are able to do data munching crunching uh they are able to write sqls and come up with the solution or for the use case they are working on so that's how we were like looking at the data we were able to optimize our routing engine uh and also the the thing related to the eta to give the like accurate almost accurate eta uh to the customers who are uh ordering the fuel so that's the power of data and so to have that kind of power you have you have to organize the data well and that's why the source of like the source of data nowadays is considered single source used to be the data lake from which all the analytics is driven so now what happens is once data is lies in the data lake the bi team or your data scientists they are responsible for doing data explorations drive some kind of initial insights from the data which is there and depending on the business use case they have uh now data engineer will work with those guys uh for creating daily ets so the atls can range from it can be a daily etl some etls are also real time some are hourly based so depending on the different use case you can have different designs or design patterns for etls and all these etls you have to schedule it and now the and what happens is it usually it's cleaning and transforming so the data in the data lake is raw and you can also there can be some some kind of noise in the data which is available in the data lake uh it's not perfect so that's why we need to have etl processes uh so to clean it for one kind of example is time stamps so maybe a data coming from a mysql source uh the data the stamps were stored in utc but but the other source like for example post crystals i'm getting the data from other postgres stores and there the time stamps were stored in maybe other kind of maybe local time zone maybe they were stored into buy time or maybe in american time or any other time zone so to have a unified interface to you so that you have a standardized logic for this time stamps you have to make sure that you convert it into a standard format and also the format whether you are following year month day or month in so many other date formats so to have that kind of time stamp standardization all these things uh that can be decided into the etl uh and which like for example which zone you have to stick to and what kind of format you want so all these things it goes this logic business logic goes into etl and that's where you decide on the missing values if you have any what to fill it fulfill it with so that is the part of the cleaning and transforming of data so once that process is done now the data has to go into a target and that target usually is a data warehouse and it when i say data warehouse means there is some kind of a management system usually nowadays as industry works on cloud so you have a lot of cloud-based data warehouses so one of the such data warehouses red ship so it's by aws and it's managed by aws and you can directly use the their data warehousing service if your company is working in uh aws if you have application setup on aws then you not too much things you just matter of a few clicks wherein you instantiate a cluster and start using it so we use redshift because it's a managed solution plus it can handle petabyte scale of data even if the data grows to petabyte scale uh redshift is designed in such a way uh that can you can easily scale and handle the load and plus this maintenance is there any kind of database involves you there is some maintenance you have to do but majority of it is handled by aws and you need to take care of some of the things here and there like for example a clustering mechanism we are partitioning data partitioning schemes that how to distribute your data across the node so such kind of activities metadata management activity they have to perform and now once this clean transform data set goes into red shift now what you can have is uh the next part comes to present this now you have your data sets into a target uh the use case you're working on now you have to build visualizations like dashboards you need to present those results to your management your key stakeholders uh now that's where in now what we did we connected our power avi to redshift so earlier the power bi directly used to connect to the actual sources but now that source is cut off and now the single source of truth for power bi is the red shift data warehouse so all our business use cases the etls related to it and all the clean transform data sets uh they lie into that data warehouse and we have all all our dashboarding built out of the data which is in uh redshift and uh once the dashboard is being built so and the power bi in itself has a different mechanisms that when to refresh and how to refresh and what time interval it should refresh so you have up-to-date data being presented uh up-to-date visualization graphs being presented on uh power bi so all this overall journey right you saw like this there is a actual source of data wherein like application data store in our case is an application computer application it's a mobile application and the back end for it is there depending on you there are many backends one source is mysql some of the data goes in progress some of the data comes from the event system site about talk about like clever tab this semi-structured data some of the data comes from iot that's also same structure so the actual is in json format so that's that source the second is bringing the data like doing an e so it's not at uh etl so doing doing an el job it's called as extract and load so we extract from the source system and we load it into a data lake so we have to run el jobs to get the data into a data layer and then once the data is in data lake then you have you can write your etls which is extracting from data lake transforming and then loading it into a data warehouse which is the actual target and after that you have the visualization layer so as part of this like for setting up like you might have this question that how should i construct this data lake what components should i use for storage and where should i do it now that also depends on your company decisions for example now we purely operate on aws space so we use aws components for use aws components for architecting uh the entire data platform but you might work on azure you might work on google cloud and they have the similar components that's available so they have their own data storage systems big data processing systems so the architecture wise it doesn't change much you just have to make use of uh related components what what's available in the other clouds or there is also like you might your company might work on premise you don't work on cloud because of uh things related to data security and your company works on premise and you can also set up things on premise and uh so you have open source tools like hadoop hdfs for example hdfs file system can act as your data lake as a central data store if you are working on on-premise if you are working on aws in our case so you can have your data lake as s3 so s3 is an object store it's a simple storage service and it's quite cheaper to store information in s3 and it's all the things are in are being considered as objects so there is a difference between a actual file system and uh the object-based files object-based file systems so they are basically called as object stores so the difference lies in how the information is stored in in such stores and how we do we retrieve it the so but that's very abstracted as a user you won't come to know you will think that okay this is looks like a normal file system but under the hood the architecture and the technology is different between uh like for example other than this linux file system there is network uh nt5 system uh then you have hdfs there is s3 there is azure blob so all this there is certain difference between uh all these kind of architectures but your s3 if you are working on aws can act as your data lake and you you have all your data the data from all the sources stored there and the now the next question becomes uh like what kind of format should i use basically for storing the data that's also very key decision and that's actually the decider if your data lake environment will succeed or it will fail because if you are storing terabytes of data then you can just store them as csvs the reason being the core reason being is the csvs are they follow ruby's architecture they mean when you try to query over csv it's row based uh if you know the difference between row based and column based uh so it will scan if you write the select xyz from xyz column from xyz table uh then even if you mention one column it will scan the entire data uh which is not very efficient if you have terabytes of data so that's why i need to choose on what kind of file storage format you will use so for querying over big data usually what we prefer is storing the data in parquet format so the parquet is a column file format there is also some other similar formats like orc it's also column based format and there are two big reasons why such formats are used because it provides huge performance improvements over s3's and jsons and second is the compression they offer great compression rate so against like if you if uh 100 maybe a 1 gb file of csv will be around maybe i don't know maybe i think 100 or 200 it will consume 100 or 2 mb it will it will be compressed 200 mb at 150 mb if you store it in parking so that's the kind of compression rate it will offer you 6x7x compression basically there is a compression scheme also usually the default scheme is snappy it's called a snappy compression when you are storing uh for your data in rk file format so once your data is in parque format the now how it offers performance improvement is uh is if you write a query like there is an employee table suppose hypothetically and you need id and name of employees and you write a query called as select id command name from employee table now it since it's a columnar file format it will only scan the data for id and id and name column so even if there are 10 other columns it won't it will filter out those data it will neglect it will only project the query on your id and column id and name column so that's how the data scanning is reduced and the data is being retrieved faster but if the same data was in csv even if you have mentioned id and name in your query it will under the hood scan the entire data set so that's why the big data scenarios you try to avoid [Music] in part case of orc because of the compression and also the query performance it gives so so there is also another layer like what we once the data is in s3 you have stored we could park a format now what tool should i use to query it so there are a lot of open source tools available one of them is presto it's a distributed sql query engine so it's just it's not a database it's a query engine so it it works on concept called as query data where it lives so your data lives in s3 files so when you write the query and you press store it will do the projections right where the data path whatever you have mentioned uh for the table and the query is projected over the data and then the the results are computed it's presented to you but usually what happens in other database systems is for example mysql when you write any type of queries the data all the data gets loaded into the ram of that server and and basically all the data and that's where in my high sql mysql database and other such database they are not being used for doing big data analysis because uh you just can't have unlimited ram there is all there is always be some limitation so you will have to use the systems which are more catered towards uh doing that kind of analysis so presto is one such system uh press two is called as presto d sql you can search over it's called as b-r-e-s-t-o presto uh maybe i can also type it in the chat uh and uh and you can set up your own presto cluster uh and it can be uh endnote i mean whatever depending on your data size you can have a five node plus presto cluster six node uh presto cluster or if you are also working on aws like how what i chose is because since we have resource crunch so what i decided uh was to use aws variant and it is called as athena so athena is aws variant of presto [Music] it is purely managed by aws and so you don't need to worry about how it scales and what kind of resources i should allocate for the query because a lot of query management related concepts so since it's managed by aws all those things are taken care of taken care of by them and and the the power of that is uh it will scale by itself in the background uh you need not instantiate any kind of uh cluster on your own for example if you set up presto cluster or by your by yourself like in-house customer cluster management so then you have to define the scaling policies but if you use a aws managed product all that kind of overhead is being shifted to the cloud itself so that's where in the you brings in the ease of use and also a lot of maintenance activity is reduced and if your team is very small then using managed services becomes more beneficial than using in-house related components or self-managed services uh so athena is one of them uh all the presto queries they work in athena and so once and athena it works with now you also need to have a data catalog so in aws we have a glue aws glue data catalog uh where in all the metadata information so when i say metadata information all the information related to tables so for example i have an employee table so what columns should i use from that employee employee particular file so id name department uh salary there are so many other can be columns so you can define those columns their data types then where that particular file lies on s3 so you defined as part of the period statement and this metadata information gets stored in aws data catalog and the athena interacts with aws data catalog and using that information it projects your query over your data on s3 and it will retrieve the results so that's how uh in terms of and how it works under hood it's the high level uh thing i told you uh but of course if you have to read in depth uh you can always go to the aws space they have more information so i'll just write the data catalog service name called aws glue data catalog and uh so the aws themselves they provide a query editor so you don't not worry about it so there's an athena query editor you can use uh to query query over your data and all your team members whoever in the data team like to be data scientists analysts we also have a growth team here so our employees they go in uh use that period for querying over data in our data lake uh so for a user like end user uh so for example you are a data scientist and you are not that good in terms of software engineering principles or you don't know how the big data environment works you are being hired for doing mathematical modeling and statistical knowledge but you don't have much information related how the data is organized or stored then for you you will think that the athena or presto is kind of a db it gives you that resemblance that feel that you're using some kind of a database but it's not a database the data lies in pure s3 files and you actually the beauty is the athena presto system designed such a way that it projects the query over your data variant lies where it lives so it's quite a interesting concept and i will recommend you guys you can if you are into data space and you want to make your career in this you read about that those architecture is distributed architecture and once you are in little video spatial or data engineering now there is also title called as geospatial data engineers all of us we need to know about distributed computing and to go through such architectures is a great learning so we'll recommend you guys you go over the press 2 db architecture and you will actually come to know what i am speaking maybe for some of you might go over the head and it might not relate but maybe once you go through it and watch maybe some youtube videos or read over it it it will make sense uh and now the next aspect becomes like how like what tool should i use basically or what language should i use to connect to the data sources crunch it bring it to data lake so there is one very good tool is open source tool uh and it's called as airflow so airflow is a workflow management tool uh it's basically used for writing your data pipelines so let it be el pipelines as we spoke about extract from source and load into a data lake those kind of pipelines and plus the etl pipelines like extracting from data lake doing transformations and loading into data warehouse so for this any of this design patterns you can have you can write a data pipelines and you can use airflow and airflow is an open source to open source software it's developed by airbnb but now it's open source and it's all python based so everything in that tool is tag based it's called it is directed acyclic graphs and you define your graph and all that you have to do is using python they have a lot of integrations third party integrations a lot of integration with all major databases file systems aws azure gcp or the custom wrappers are available and even if the wrapper any kind of wrapper is not there it gives the flexibility to the developer to develop to write it and good thing is everything is python based uh so you can make use of that uh particular tool and there are also similar other tools that have come up uh it's called as daxter prefect uh it's airflow it's called as it's under apache now it's called as apache airflow [Music] for workflow management so it's basically being used nowadays by big players for uh managing their data pipelines and since it offers a visual interface so there is an ui uh wherein you can go and you can see that which pipelines have succeeded which pipelines have failed uh then you can it also has a logging mechanism where you can see that why the pipeline failed uh then you can set up alerts for example you have set up alerts on slack like for example if my pipeline of loading data from mysql to data it fails so it will just trigger an alert and send me to a slack that what that this particular pipeline field and it will give me a path where i can go and yes yes deep nime uh is also a good tool uh but you know mostly nime uh is being used for machine learning activities so it has a lot of inbuilt mechanisms for performing machine learning uh and it's great for that purpose uh yep but that's true name is also good uh but if you're purely doing machine learning any regression classification type of models then yeah an aim is great but in terms of airflow it takes care of the entire cycle so it's not only machine learning since it's and you're not restricted you can develop your own wrappers over it you can define write your own pipelines and and have your end to end architecture data pipeline architecture defined you using a dac it's a directed acyclic graph and any kind of monitoring activities uh logging alerting so it has integration with all these concepts and it makes your life as a data engineer a lot easier and later on uh so this kind of tool can be used for el and etl purposes uh for getting the data to data lake and from data lake to uh the red shift uh so this and so the idea now we spoke about was mostly around like patched data processing uh like the tools uh which like for example airflow tools and these tools they mostly cater to batch data processing but if you have some kind of real-time use cases uh then you have to shift your gears a bit uh because real time is little bit different and more complex and there is a lot of other particular components that gets involved for example you might use kafka what kafka can be one of the component for building real-time pipelines if you want to use aws managed service then is aws kinases it's a variant kind of a variant of kafka which aws has developed and they work on any kind of unbased systems like how kafka works and it's and they have their mechanisms for data storage and it works on the uh publisher subscriber model uh and basically publisher is something uh for example if you have some data change happen in mysql so you can make it publish to a particular topic in aws kinases and you can have downstream players who will read from the aws kinases for do perform processing and then store it into a target and for processing those you have couple of uh technologies for example you can use spark uh aws you have aws glue etls and underneath they give you the capability to run spark code so that can be used for real-time pipelines but the in any company that i have like observed the number of real-time use cases they are little bit less most of it is batch processing and mostly the analytics in major companies happens at t minus 1 but compared to that kind of load the load for doing real-time processing is less and also like you have to decide that is it really the use case they are working on is it really need to be built in real time because you only want to build real-time pipelines uh only if it's going to solve some purpose uh for example in our case uh we want to know in real time that which of our orders are getting delayed so this kind of information is very critical and we don't want when we want to want to know at that instant if the order is getting delayed if you come to know like four hours down the line or five hours down the line it's no use so if we have such kind of use cases then yes you have to go around and we have to build the data pipeline but if you have to do reporting right any kind of reporting that how many active users are there in my system uh or how what was the number of order count until like over over plus three months what's my order count how it is is it growing up it is going up or it is going down so such kind of analysis you don't have to do in real time i mean it won't solve any purpose so this is mostly reporting things and that can be done in batch processing and that those kind of analysis can be t minus one so you have depending on the use cases you have to weigh out that uh and you have to be out like what is my end result and what i am going to achieve that you will have to ask your ask yourself these questions that uh what kind of pipeline should i build is it real time uh going to solve my purpose or even if i do batch processing it's okay uh because setting up backpressing was less cost and real time will evolve very huge cause your cost like shoots up exponentially so that's the reason why you have to dive in your use case do some analysis do some experiments and then decide on any kind of production architecture now in this particular like architectures nowadays what's uh thing is coming up is data validation so once your this architecture is set up uh but the next piece where in the big data projects or any kind of data project fields is uh if you don't perform any kind of data validation like for example when you develop an application software you specifically have a software testing team and that's there for a purpose because whatever features you are going to provide in it it needs to be of quality and it needs to be trusted and that's the reason why uh we have a software just any company there is a software uh testing team there is quality assurance team uh so that the product that goes shipped out uh it's of that grade so same thing same thing also applies to the data now uh you just can't say that uh okay i've written my data pipelines are they like flowing but how to validate whether whatever you have done is right so that's where now industry is moving so right now we don't have a lot of we don't have a lot of tools for data validation but it is coming up uh and now some of the tools that they are like not open source also they are like third party and you have to pay a lot of money uh but now a lot of open source things are also coming up uh i specifically uh started using uh library called as great expectations and it's basically used for for data extra data testing so how they brand it is doing performing your data testing and data validation once the data is in stored in a data lake or before going to a data warehouse you can have your validation script being run and you can have it in the f that particular task that uh you can have it as a node in your graph and that node can be called as data validation uh and that you can perform using this library uh and you can like for example what kind of validation right so you have some like uh any table like let's just just take an example of employee table we are working on in that employee table we have a primary key called as id and we know that if it's a primary key it can't be duplicate so you can have one of the assertions or testing like using them you can use this great expectations tool and you can have a particular test being set up on that column that id column should be unique if it if you encounter a duplicate id just throw an alert and and don't load the data uh and same thing can be done for name for example in name it it is like uh alphabets like this there is no need of having any num usually your if it's a name full name it doesn't involve any numeric so you can have that kind of checks that whether any any kind of numeric thing is going into a name or not uh plus now like for example the id column itself the id column should never be none so you can have another that kind of check being set up on that column to see that are you with the data set the new new data set is there it's coming day by day does it have any null values if it if you encounter a null value then just throw an alert and you can throw an alert to a slack channel so this aspect of data validation then you can set up dashboards around it like for monitoring it uh for example all whole monitoring piece right there this another uh important aspect so we specifically use a new relic aws also specified gives us a lot of things around doing any kind of monitoring and when i say monitoring is you have such a huge cluster like for example if you have an uh if you have set up airflow for example in my case i've used aws ecs elastic container service to set up my airflow cluster there is a redshift component then there is like s3 then there is rds for storing the data and there are ec2 instances wherein the processing happens but you have to make sure that all these things are up and running although aws says that they will say that from their site it will be 99.99 percent of the time available and up but you als you also need if but there has been cases wherein it has gone down uh recently like a couple of three months back uh there was an outage uh and a major portion of the internet uh has had faced intermittent issues so such things they happen so you have to have monitoring for that purpose so that when such disaster happens you come to know immediately so that you can take corrective actions uh so for that purpose uh there are a lot of tools aws provides uh by themselves some of the tools but uh but they're not that friendly so that's why we went ahead and use new relic it's used for specifically for your infrastructure monitoring all of your infrastructure components are being registered with that new relic tool and uh and all your sls you could be can be defined there and you can have it monitored there to for example it gives the capability to set up an alert if my ec2 instance capacity like for example i have utilized 90 percent of my hard disk space on ec2 then send me an alert if the cpu being cpu utilization is greater than 80 send me an alert if my memory being used is more than 90 percent sentiment alert so such kind of monitoring alerts you can set up uh and then uh that and and i think this piece of information you make sure that your infra data platform is running as expected and there are no outages or the end users they don't face any issues so all these aspects they somehow come into the data engineering aspect the the work of a data engineer right it's not only limited as we saw to writing any code uh or just remember writing any ets or data pipelines it's much more beyond that the for example we saw data validation piece uh then this data data monitoring infrastructure monitoring piece all these things also uh have to be done by data engineer and more than that we have we have not touched base on this information security uh that's another aspect where you need to have some level of expertise uh because as you know as and when there are data breaches they go on and data breaches they happen in big big companies not even like for example even amazon and facebook has been preached a couple of times in the past so you can imagine that how important it is uh to have the information security practices in place uh and there are the there are specific tools itself and again like if you're on cloud the cloud environment has a lot of things you can have specify like basic things like access control like uh what particular for example if you have a marketing department so you can only expose marketing related data to those departments and then you can restrict uh them from viewing other data so it brings down the concept of building on the concept of like data marks wherein it's specific to that business unit and you restrict them not to view any other data which is outside their business unit uh and that also this kind of principle with a data engineer is responsible so is basically responsible for getting anything from getting the data cleaning transforming making sure that the right data is in place and if there are any anomalies catching it repairing it uh there's a lot of activities that goes on uh other than any like there like lot of housekeeping activities i will say uh so so the presentation right any kind of presentation layer now if you want to do any mathematical modeling or machine learning so that only accounts like for fifteen to twenty percent of the work but the eighty percent of your work goes on setting up the infrastructure monitoring the infrastructure writing your etl processes monitoring your retail processes fixing them that's and cleaning like for example data cleaning that's the eighty percent of the job and the twenty percent then uh is something you can develop your machine learning models and then visualizations but that only accounts for the twenty percent eighty percent the whole big chunk is actually engineering and if that's not being set up properly for example if you don't have data validation thing in phase place then some garbage will come in and so it's like garbage in garbage out that means your visualizations you are performing or machine learning models they won't perform properly because you are not giving it quality data if you don't provide quality data that means your model can't perform the way you want and the decisions you will take from that it will not be accurate and it will hurt your business uh so that's the reason why all this data engineering specifically becomes very important and that's how they i just touched based on few of the topics and and since we are running out of time uh it's like stop for my uh session here and i would like to like hear from you guys whether did you follow or if you have any questions uh we can pick up that yeah sure yeah it has been actually great and clear i hope to most is someone still not clear for sure which a better engineering role entails and and what it does because somebody here just summarize for example you get data from various sources right data lake that is like doing extraction and loading you don't do any transformation or any stuff there just as it is and it's good to save data not in csv format because this is v when you create this video that's computer intensive right yeah it's good to have other better formats right and then after that you can do the loading uh here extractions from and load to to to a data warehouse and this is where now the dashboard and analytics can happen from that area and in between there's so many tools so it's not about repetition just to make it clear another engineer is about prosecuting the data from the source processing and storing processing and making it in a usable format for for data scientists or machine learning engineers or yeah yeah so basically that's that's it and yeah it's usually people usually start being going towards that sense but they forget that engineering is like here all because right now again the demand is there there is a huge i strategy for it i think so i don't know then how they're demanding to buy so far right now yes so the demand is it's keep on keeping on growing and now thing is the demand in data engineering space uh now the industry started realizing that a data scientist cannot perform all the activities you need to power the data scientist with data engineers because all the data quality mechanisms and all the setup you have to perform it cannot be performed by one guy so to to take out some of the responsibilities from those guys so that they only are being confined to do mathematical modeling and uh and they work on the business use case but other the other activities like data processing and all those things uh it's being shot started like industrial started realizing that we need data engineers for it so the demand for data engineers now it's increasing and increasing yeah yeah it's also good that for us we've also noticed it because um initially though i never see we never used to see rose that i'm your special net engineer but from last year they've been popping up now and then so it's i think it's a trend and yeah it's it's about early birds to join and just to to pick it up so yeah i don't know if anybody has got a question on anything or anything wasn't clear or if if the skills that's needed wasn't clear since you're so clear then okay yeah so um hello hi yes yes sorry i just i think i sent a bunch of questions with all the writing things while listening oh yeah there they are so yeah cool uh most of them i've even been answered but maybe just from public context oh yeah maybe just go through them take whatever you want thank you yeah so like so something that was not answered can you just point it out so that will uh yeah it would be good to point out that through to the detailed question yeah so maybe you can what is more important than those questions you mean you can point out to that so can i can answer it yeah let me learn my chat again but for one i know like the last question about like uh wandering going through the first three quarters of the talk wondering where i should think to plug in my machine learning pipelines i should i learned from data lake or like you know still data or called data because it's learning really near real-time data and you talked about that yeah so like that to me yeah okay um uh yeah yeah okay yeah i'll answer that first uh so for your uh running your at what state uh so we'll repeat it maybe how i understood is you want to know that at what stage should you plug in your ml pipelines so usually uh that stage is once you get your data link in a central type of store like for example data lake uh therein now you can start your data science activity because now you that phase of data acquisition is there but now what kind of data you need from the data lake then you that you need to decide on for that you will have to do some data exploration uh and you will have to load that particular data set and in that data set and the rest of the process you'll be working on it like data cleaning and all these things you have to perform and all the features that are you don't need you can drop them and then you get it to your model and you have some inferences and then uh model training or testing phases and then deployment that's all there but once the data i will say a good start is once the data is in italy uh at that point you can kick off your ml pipelines cool um maybe like i there's the other questions but if i was to pick one for the interest of time maybe the question of how do i scale down this very interesting architecture to a small company i have just small data in my small team how do i like get the benefit of these very advanced data engineering yeah sure so easier like for example is your company uh work on cloud like are they open to get get your setup on cloud or is it in-house well i have to say my small car my small company is imaginary it's not a real one yeah but yeah so for that i mean either way right so if uh it's possible for you to work on cloud that will be great so once you are in cloud i mean it doesn't matter how you scale whether you're small or big uh for example like setting up a data lake right so even if your data is in gps it's fine it will work well even if your data is in terabytes you might have to tweak some things uh here and there file formats and some kind of pop query performance mechanisms are there but it shouldn't matter i mean and for example if you don't want to work on cloud and if you want in-house or something then you can work on you use open source tools like as i mentioned hdfs you can use hadoop distributed file system and you can simulate it right in your laptop uh you don't even need any cloud environment you can download uh have that hdfs setup in your if you have a good machine you can do that in your machine and simulate and you can set up airflow uh then you can define the all the related because airflow also is it's no cost you have can uh develop a docker image and you can spin up uh in your local machine so all those things so the open source variant are there and you can actually simulate the entire uh this architecture on your laptop itself yeah okay cool i'll give back the flaw to other questions thank you yeah betty thanks uh for your uh insight and comment and i hope you enjoyed it and you got to know new things from the talk yeah absolutely thank you gentlemen i think i can answer one of your questions you ask how different especially the special data engineer relative to the general data engineer yes yeah i just think the main difference uh based on what i've been seeing and i can even share a role here it's it's uh it's basically that uh the uh that you can focus on jio's component basically because i think for example like a giant uh maybe he hasn't done things like maybe a reference system right or maybe processing tons of satellite imagery and it might need some now that your special domain and maybe this watch there are special that as uh engineer keeps from picking up because actually most that pop up i see they also tends to most likely involve some satellite data processing so the one that i was saying earlier let me see if i can share the screen yeah so nowadays you have the the satellite data and things for example the pressed or right pressed to athena you were talking about now they are maturing enough that now they have included a lot of geospatial related functions uh in their vocabulary so now anything related to your points and coordinates uh how the post gis works right the functions you use with postgis all those have been made available in the presto and athena yeah sure so you can do a lot of any also in red shift so there are a lot of yeah there's some other relationships and and even even google google is not doing the gcp is doing i think the only lagging band that you should be just asha but yeah so for example amazon you see all the special functions are implementing now on redshift yeah and also google bigquery definitely gcps yeah and all they're really exciting to come to catch up and implement um just special data engineering capabilities on these platforms and that's unnecessarily great yes so yeah but the difference is things like reference systems and um just knowledge for for jio staff okay anybody else or we call it the end is everyone satisfied you can do the thumbs up be sure we are closing it at the right time then okay perfect yeah everyone is yeah thank you all great yeah okay it was really great for doing this uh yeah thanks uh and at least it's more clear so for people who are really interested in your special data engineering just um just do those uh you you you realize you need to be to know python and sql obviously then pick up any cloud infrastructure maybe aws or uh and just try to play around and understand it and uh how to do data warehousing the data lakes yeah that's a good start for to play to this rose yeah also brian what i think is i mean if anyone of like based on the feedback can collect see if there is any specific topic of data engineering wherein you want more insight for examples how to write a data pipeline in airflow using python or how plays an important role within the data pipeline itself how to do this data crunching and munching we can have some session now this time it was only uh like we didn't have any slides or anything uh nothing hands-on maybe sometime you can have that kind of aspect based on if the interest of the audience uh so that will be like hands-on session wherein you will you can do something think something live maybe it will make relate more that what we spoke about today oh yeah it depends on the interest of people there yeah sure yeah so guys we've had so if there's a specific area in that engineering that you want us to get started with yeah we'll discuss this and yeah you can have giant over and yeah just to showcase us how to do that okay bye it was good having all of you uh i hope it's now clear what they're all entails and yeah see you all next time yeah sure thanks brian thank you yeah thanks a lot man yes yeah bye [Music] bye [Music] [Music]

Show more
be ready to get more

Get legally-binding signatures now!

Sign up with Google