A Simple Guide to Data Science & Engineering
Back

May 10, 2021

A Simple Guide to Data Science & Engineering

It’s no wonder that many tech companies started depending on data science. There has never been a better opportunity to make a powerful impact with your data. Terms “digital” and “data” have become the talk of the town and there has been an explosion of data science career options with 46 % increase in hiring for these roles since 2019, according to LinkedIn

Data Science is revolutionizing businesses across industries. But, what do Data Science and Engineering stand for and what value do they bring bring to the businesses? What is the business importance of data? This article provides answers. 

In this article, we will:

  1. Cover the basics of Big Data and Data Science 
  2. Cover the basics of Data Engineering 
  3. Point out HTEC’s role in creating powerful data science solutions

What Is Big Data?

Big data is a common term for any collection of data sets that are too large or complex to be processed using traditional data management techniques. Data science involves using methods for analyzing massive amounts of data and extracting the knowledge it contains. To help you understand this concept more clearly, try thinking of the relationship between big data and data science as the relationship between crude oil and an oil refinery.

The Characteristics of Data 

The characteristics of big data are often referred to as the three Vs:

  • Volume – How much data is there?
  • Variety – How diverse are different types of data?
  • Velocity – At what speed is new data generated?

Often these characteristics are complemented with a fourth V, Veracity: How accurate is the data? 

These four properties make big data different from the data found in traditional data management tools. Consequently, the challenges they bring can be felt in almost every aspect: data capture, curation, storage, search, sharing, transfer, and visualization. In addition, big data calls for specialized techniques to extract the insights.

The Types of Data 

There are three types of data:

  • Structured
  • Semi structured
  • Unstructured

To use data and create reliable information from them, it is necessary to recognise their structure. Structured data is clearly defined and searchable, while unstructured data is usually stored in its native format. Structured data is quantitative, and unstructured data is qualitative.

What Is Data Science?

The data science process typically consists of six steps:

  1. Setting the research goal
  2. Retrieving data
  3. Data preparation
  4. Data exploration
  5. Data modeling or model building
  6. Presentation and automation

Setting the Research Goal

Data science is mostly applied in the context of an organization. In this first step, we discuss what we’re going to research, how the company benefits from that, what data and resources we need. This stage also includes creating a timetable, and discussing deliverables. 

Retrieving Data

The second step is to collect data. We should know which data we need and where we can find it. This step includes checking the existence of, quality, and access to the data. Data can also be delivered by third-party companies and take many forms ranging from Excel spreadsheets to different types of databases. 

We want to have data available for analysis, so it is crucial to find suitable data and get access to it from the data owner. The result is data in its raw form, which probably needs polishing and transformation before it becomes usable.

Data Preparation

Data collection is an error-prone process. In this phase we enhance the quality of the data and prepare it for use in subsequent steps. This includes transforming the data from a raw form into the data that’s directly usable for the clients. You can learn more about this below, in a separate chapter called Data Preparation.

Data Exploration

Data exploration helps you understand data more deeply. It provides you with an insight into how variables interact with each other, the distribution of the data, and whether there are outliers. To achieve this we mainly use descriptive statistics, visual techniques, and simple modeling. 

During exploratory data analysis, we take a deep dive into the data. To make data and the interactions between variables easier to understand, we mainly use graphical techniques.  

Data Modeling or Model Building

At this stage, models, domain knowledge, and data insights we gained in the previous steps are used to answer the research question. First we select techniques from the fields of statistics, machine learning, operations research, and so on. 

Building a model is an iterative process that involves selecting the variables for the model, executing the model, and model diagnostics. The goal is to create an adequate model which will predict the future behavior of the data. Keep in mind that you need to be careful when choosing algorithms and techniques for prediction, because one algorithm does not suit every problem.

Presentation and Automation

Finally, we present the results to the business. These results can take many forms, ranging from presentations to research reports. In the world of Big Data, Data Visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions.

One goal of a project is to change a process and/or make better decisions. The importance of this step is more apparent on a strategic and tactical level. Certain projects require us to perform the business process over and over again, so automating the project will save time.

What Is Data Engineering?

Data Engineering is a relatively new term in an old field. It originated from Data Science, with overlapping goals, but with a different focus. The main goal of data engineering is to design, build and structure data pipelines, ETL flows, analytical platforms. Data engineering is taking care of data collection, data cleansing, data modelling and data serving. 

How to Store Data and Process It?

In traditional data warehouse systems we have a model that is called schema on write which means that we need to know the schema before we store the data. Collecting the data in the traditional system was slow, as a lot of time was spent analyzing the data, understanding business cases, etc. With the Data Lake revolution (Data Lake is a centralized Big Data repository which allows us to store structured or unstructured data), the industry switched on to a schema on read model.

This schema enables us to collect the data even if we do not know its purpose. Only 12 % of the data stored in the world is being used or brings value to businesses. This shows that companies are taking advantage of a schema on read model as they collect massive amounts of data without having real knowledge of it.

As our focus here is only to store and collect the data, we can store a significant amount of data in a short period of time with minimum effort. 

This will require a good and structured organization of storage, especially for later stages and analytical workloads. A common pattern is to divide the data lake into zones/layers:

  • Raw 
  • Cleanse
  • Application

It’s really important to partition your data wisely, as that can reduce your analytical processing time and query costs. 

Depending on the granularity and frequency of ingestion/extraction, data can be partitioned in different ways. 

For example, daily partitioning:  

  • /Users
    • /2021
      • /04
        • /25
          • rawFile.xml

Once the data is stored in the Raw folder, you can perform cleansing using analytical tools which can communicate with the data lake. In this example, we partitioned our data daily, meaning we will process it once per day in a batch. Data is processed either in batches, or through streaming. Both processing techniques are quite important, and, depending on the nature of data, we should choose one of them. 

Streaming is mostly real-time or near real-time, where we need to catch information from some event, IoT device, sensors, etc.

Batch processing includes processing of large amounts of data at once. 

Data processing architecture which takes advantage of both batch & streaming processing is called Lambda architecture.

Data Preparation

To transform the raw form of data into the form directly usable for the clients, we’ll detect and correct different kinds of errors in the data, combine data from different data sources, and transform it. If we have successfully completed this step, we can progress to data visualization and modeling. This phase consists of three subphases: 

  • Data Cleansing – Removes false values and inconsistencies from data sources. This step contains the elimination of errors caused by input: unrealistic and missing values, blank lines as deviations from the predicted values. When entering data, there are often many errors. Therefore, it would be best to do the process of entering the data correctly, but this is often not possible, and then the changes are made in the code itself. Learn more about this step, below, in a separate chapter “Cleansing of the Data”.

 

  • Data IntegrationEnriches data sources by combining information from multiple data sources. Data integration is the process of combining data from different sources into a single, affiliated view. Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping, and transformation. Data integration ultimately enables analytics tools to produce effective, actionable business intelligence. Even if a company is receiving all the data it needs, that data often resides in a number of separate data sources. For example, for the purposes of one Finance report, we have to combine details of two important and separate sources of incomes and costs. Whether these systems work independently or are updated periodically via interfaces, data can fall between the cracks. If different systems can’t share their data in real time to create one unified picture of the business, management can’t gain all the insights they need to make strategic decisions. Therefore, Data Integration allows for better use of resources, better inventory management and cost, and time savings. Moreover, automatic updates keep both systems in sync and prevent the possibility of inconsistent or inaccurate data. Quicker revenue recognition is guaranteed. 

 

  • Data Transformation – In this phase, we are facing different formats of the data, blank values, unexpected characters, etc. Data Transformation ensures that the data is in a suitable format for future analysis. Properly formatted and validated data improves the data quality and protects applications from potential landmines such as null values, unexpected duplicates, incorrect indexing, and incompatible formats. Domen knowledge is also very important here. Data analysts without appropriate subject matter expertise are less likely to notice typos or incorrect data because they are less familiar with the range of accurate and permissible values. 

Cleansing of the Data

The extraction process includes retrieving data from different sources and placing them in appropriate data warehouses, like Azure data lake storage. To have readable data and allow users to access it directly from files or through reports, it is necessary to clean it first. So, in special “Cleansing” folders, we store data ready for use in the “Reporting” part of the project. 

Cleansing data is not a straightforward process. This is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

When combining multiple data sources, data can be easily duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. Also, there is no one single way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a template for our data cleaning process so that we know we are doing it the right way every time. 

If we look at it practically, we create special cleansing procedures that refine the data every day or every month depending on the need and logic of each project. Data cleansing is the process of removing data that does not belong to your data set. Data transformation is the process of converting data from one format or structure to another. Nevertheless, these two processes, in most cases, do not exclude each other and are executed simultaneously.

One of the biggest challenges when preparing data is the fact that this process has always been very time-consuming. It has been widely publicized that up to 80% of the overall analysis process is spent cleaning or preparing data. However, as data has increased in size and complexity in recent years, data preparation has only become more demanding.

Model Planning

Data modeling is the process of creating a representation of a whole information system or parts of it. The system is represented via schema and it shows organized attributes with types of data and relationships between them. Model planning is a process that precedes data modeling and it is mostly built around business needs.

There are different types of data models. In our system, models are built on a star schema, a dimensional data model. Star schema is organized into facts (measurable items) and dimensions (reference information). Fact information represents observations or events stored in the table. Dimension information represents a description of business entities stored in the table.

As schemas are designed to address the unique business needs (and we are talking about big data and complex reports), in practice, they also occur in other types of schema like Snowflake and galaxy schema. 

Snowflake schema is a logical arrangement of tables in snowflake shape where dimension tables are normalized and additional tables are added. 

Galaxy schema has a galaxy view, a collections of stars, which is represented with multiple fact tables and their dimensions. 

Model Planning Steps

  • Identify business rules 

The first step is to collect and understand all business rules from business stakeholders and end-users. Defining all business cases and requirements will let us understand how to structure a data model in the best possible way. 

It is important to see the big picture of how the data will be used across the organization.

As business needs may be changed over time, new models with changed business needs may also be developed.  

  • Understand domain

To understand business needs, gain the most important insights, build an easily changed model, and predict additional requirements, understanding the domain is key. Domain knowledge can affect how well the steps of identifying measurable items and their attributes will be implemented.

  • Identify the entities (fact)

Identify key entities around the model that is built. These can inlcude events, concepts or things represented in the dataset. 

  • Identify the dimensions

Identify attributes for each entity type. Attributes are describing each entity type and they should be stored in the dimension table.

  • Assign keys in the tables

The next step is to assign keys in the table as needed.

  • Identify relationships

Relationships should be built depending on key columns in fact and dimension table. Tables interact with each other based on their relations. Relations between the tables also depend on the needs of the end product/report. How the report will be used and how data will interact also affect the model and relationship between tables. Building relationships between tables should give you a dataset that can answer any required business questions and needs.

  • Validation of the data model

The data model is responsible for ensuring that all data is stored in the correct way, and that entities and dimensions are in the right place so that they can serve reports and provide key insights to the business. 

How HTEC Expertise Accelerates Data Science & Engineering Projects

Data visualization helps people understand the data and see similarities and differences which are not visible in raw source data.

HTEC Group is working on several projects where we help our clients make better business decisions based on a variety of reports.

Data Science Projects We Work on 

Project #1

We combine the cost and revenue model (Data Integration) to gain more information which is extremely important for analysis. The information gained from the analysis helps owners and managers identify ways to reduce costs and drive additional revenues. In this way, we are able to calculate margin per customer, per services, etc. which are indicators of a company’s financial health, management’s skill, and growth potential.

What’s more, based on the insights into debt, billing and payment information, clients can see different numbers which may indicate a change in business. This approach is especially noticeable if we apply the appropriate charts, visuals, diagrams, etc.

A good diagram is an indicator of a good business. Based on a line in the charts or the amount in the tables that shows deviation from a normal trend, clients can react quickly and easily and find what they need. Also, It is very important to choose the appropriate visual, because one visual may not be appropriate for every problem. For example,  some data is better represented in tables while other data is best perceived in graphs. 

The Tools We Use to Build Solutions – The tools we use allow clients to dynamically manage and change diagrams. This lets them see live changes of the data and its behaviour. On the other hand, to create a good diagram, it is necessary to understand the data, follow all the previous steps, collect, clean the data, prepare, integrate, create the model and, only then, present it as the final result.

Project #2

Our Team also took a role in the migration process where we migrated on-premise data warehouse to the cloud, leveraging services on Azure Data Platform such as Azure Data Lake Store, Azure Data Factory, Azure SQL, Azure Service Bus, Cosmos DB, Azure Databricks, Azure Functions.

We were required to migrate existing data cubes, data models, view & stored procedures which contain business logic. Beside that, there were also APIs which served data to the consumers. In that case, the migration was not done 1-1. It was done in phases where we redesigned and improved existing solutions with minimum effort to the business departments. 

The migration process was done in 3 phases:

  1. Collecting data from all sources, ingesting and cleaning it
  2. Migrating the existing business logic that was used in on-prem procedures to transform data
  3. Migrating the existing models and reports, and exposing data to consumers

The entire migration process was carried out by a team of 5 members. It took a year to accomplish this keeping in mind that the team was simultaneously also working on other aspects of the client’s project. The current total size of the storage, considering all data, is around 40TB.

Get Started with Data Science with the Right Partner 

HTEC Group was the backbone of the above-mentioned projects and many other project across industries. Want to start your own data-science project? Reach out to us and check out what we can do for you. 

Learn more about our technology and expertise:

Find out more about what we do: