Outline

Part 1: Data in business
- 1.1 Introduction
- 1.2 Data in business: why is it so important
  - 1.2.1 Forecasting: we need to know the future
  - 1.2.2 Decision making: reaction to every action
Part 2: Looking for correlation
Part 3: How to build a model
Part 4: Data scientist

Part 1: Data in business

Author: Dasapta Erwin Irawan

1.1 Introduction

In this online era, we are surrounded by something called data. Back then, data is only considered to be related to laboratory works, school projects, etc. Now, data is all around us. Data was only something we measure, but now it is something we trade as goods. People is interested to anything that can be converted to data observation or measurement. Some says data is the by product of digital existence. People tend to analyse anything. They even interested to the rise and fall of the name “Jennifer” being used as girl’s name through out time. Per say, data is now an everyday talk in coffee shops. Or maybe in the wet market, when people talks about the rise and fall of cabbage price.

1.2 Data in business: why is it so important

1.2.1 Forecasting: we need to know the future

What’s all the excitement about data analysis. Forecasting is one thing. People always need to know what happen to the future, given with the existing condition as baseline with some assumptions and chances of disruption along the way. Forecasting is one of the main part of business. So important, that a business proposal would likely to be thrown away if it does not contain any data-based forecasting.

A time series is a collection of measurements of well-defined items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series. This is because sales revenue is well defined, and consistently measured at equally spaced intervals. Data collected irregularly or only once are not time series. A time series can be decomposed into three components: trend component (long term direction), seasonal component (systematic, calendar related movements) and irregular (unsystematic, short term fluctuations). The decomposition is important in data analysis. Because what we see in the chart doesn’t necessarily happen in real life. For instance, let’s see the chart of turkey sale. It would likely to be high around Thanksgiving. So our first guess would be it’s a cyclic phenomenon. But what if there’s a shift in the time range in a certain year. We wouldn’t expect that the data of Thanksgiving had been shifted. What if there’re more than two peaks of turkey sale in a year, and so on.

Let’s think of data as something that has its own behaviour. It may have a rhythmic natural-born behaviour or erratic. It may have a stiff and dull attitude, that insensitive to external influence, or it may have a flexible nature and very sensitive to outside parameters. Or, it just may have an erratic behaviour without having any major controlling parameter. As you can see at the following picture, forecast is part of a loop. It analyses and transforms performance into decision making inputs. This loop drives the evolution in a business model or organisation.

Fig 1 A business cycle involving forecast in the loop (from: University of Baltimore web page)

1.2.2 Decision making: reaction to every action

There is reaction to every action

I’m not into physics by that, as probably most of you too, but like an organisation, business is always changing. They evolve through the test of time. The Apple Corp now we see, is not the same as it was back in the 70’s. Or as oppose to the fore-mentioned proprietary vendor, let’s see the open source software. Let’s say Linux, an open source operating system. Firstly built as a personal computer science project by a personal Finnish student, namely Linus Torvalds, Linux now is a multi billion dollar business. You can see that a technology that started as a free for all technology, is being transformed to profit-oriented object. The surprising news is, the free-for-all Linux is still marking its way along with its commercial side. This product can change the way people see free stuff. The operating system itself is still free until now, but it is the service of building based on time series data. The operating system itself is still free until now, but it is the service of building and maintaining the Linux system that is highly commercial. See, that’s evolution and it involves data.

In the next article, we are going to talk about how we can see correlation between parameters in a data set, how we build a model, and then in the last article, we will discuss about a new profession called data scientist will be discussed.

Part 2: Looking for correlation

We have discussed how data can change the form of a business or an organisation. All the changes that might happen are based on data forecast. Now we’re going to talk about the second reason why data is important in business. Instead of only seeing a time series chart, people also needs to know what correlation can be drawn from it. Let’s just use an example.

A simple correlation case was brought by Bryant and Smith in a paper entitled Practical Data Analysis: Case Studies in Business, in 1995. They showed a case of data set containing measurements taken on dining parties in a restaurant by a single waiter. The variables include total bill ((), tip ()), gender of the bill payer, day of the week, and the tip as a percentage of the total bill. They wanted to see what variable or variables has or have the strongest influence to total tip in a week. They also compared the tip from male and female customer.

We can see in the chart, that total bill size and tip are positively associated (upper left scatter plot), but not as strongly as one might expect because there is increasing variability in tip as bill increases. Both tip and total bill have skewed distributions (upper left histograms), which might lead the analyst to consider log-transforming these variables.

Males spend more on average than females and bills are higher on the weekend (shown in the side-by-side box-plots). The 70% tip on a very small bill by a male on a Sunday may be an outlier. Much can be learned about tipping behaviour by studying this chart.

Fig 2 An example of correlation chart of multiple variables (from: National Library of Australia)

But we must put into account that correlation doesn’t always mean causation. If we see the above-mentioned case, indeed there’s a correlation between male and female customer and their tipping behaviour. But what drive the attitude had not been discussed yet. It generally lie underneath the number, that we have to dig out.

Many studies are actually designed to test a correlation, but not a causation. In general, it is extremely difficult to establish causality between two correlated observations, but on the other hand, there are many statistical tools to establish a statistically significant correlation.

You would be surprise how common sense conclusions about cause and effect might mostly be wrong. That is because a correlation can be due to two frequent correlated occurrences. Or a correlation may also be observed when there is a strong causality behind it, for example, it is well-known that cigarette smoking not only correlates with lung cancer, but actually causes it. But the hardest part is, in order to establish cause, we would have to rule out the possibility that smokers are more likely to live in urban areas, where there is more pollution — or any other possible explanation for the observed correlation.

So we can say that, causality can be started from series correlations. But we have to add some controlled variable in the analysis. As shown in the smoking example. We have to set the assumptions and narrow down the potential governing variables. We call the result as a model.

In the 3rd and 4th part of this “Data Talks” article, we are going to talk about “Model” and “Data Scientist”,.

Part 3: How to build a model

The model is the most basic element of the scientific method. And business is just as close as physics in science. Probably without noticing, we’ve talked about “model” in the previous “Forecast” and “Decision Making” parts. Both terms are brought by mathematical models in form of equation. You must know about linear regression (see the following figure) or remembered learning this subject in algebra. It is just one model among many others that does the actual forecasting for us.

Fig 3 An example of linear regression model (from: National Library of Australia)

We also talk about model when we saw the business loop diagram in previous article or if we buy our children Hot Wheels or Barbie. Even a recipe is a model. So we could say a model as an simplification for what we are actually studying or trying to predict.

This is how we build a model:

Data gathering. We talked about it in the forecast article. It can be a long time series, as long as, a rainfall data set, or from a questionnaire.
Setting the assumptions. Most model only work in a controlled environment. Therefore we have to set the boundaries. The more boundaries, the more narrow our model will be. How many boundaries we should have? Answering this could be an itterative process with step number 3 and 4.
Model fitting. This is the fun part. We can use major proprietary software like Stata, SPSS, and SAS, or you can choose the free one, like R. Those software contain many equation models that we can pick and test later on.
Model calibration. This part also automatically done by softwares. Basically, we apply our chosen equation to a new data. If the result behave the same way with our modelled-data, the one we used in step number 3, then we can say our model is actually working. If not, then we have to go back to step no 3 or even number 2.
Model application. This is the phase that we like the most. But, through time, we have to evaluate our model, based on the current situation.

Another thing we have to bare in out mind is, the Law of Simplicity. The simplest model has higher chance to be received in business environment. Top executive would probably put less care about model with 11 variables. Two or three variables model is frequently chosen by a data scientist of previously known as data analyst. In the 4th part of this “Data Talks” article, we are going to talk about “Data Scientist”, a new blossoming career for mathematicians, statistician or computer scientists.

Part 4: Data scientist

It was not until five years a go, people invented a new kind of profession, called “data scientist”. A data scientist represents an evolution from the business or data analyst role. A solid basics typically in computer science and applications, modelling, statistics, analytics and math. We are talking about a one powerful career that can predict the future, talk about it, and persuade others. A good data scientists will not just address business problems, they will pick the right problems that have the most value to the organisation.

Fig 4 A profile of data scientist (from:Emc^2 web site)

The work of a data scientist would more or less cover the following aspects (extracted from a coursera forum):

Formulate context-relevant questions and hypotheses to drive data scientific research
Identify, obtain, and transform a data set to make it suitable for the production of statistical evidence communicated in written form
Build models based on new data types, experimental design, and statistical inference

Aside to the proficiency in computer science, math and statistics, a good data scientist must have the curiosity, creativity, focus and attention to detail.

Data scientist is always needed as far as there’s data involve in an operation. Companies that hire data scientist include:

Construction companies
Utility companies
Oil, gas and mining companies
Hospitals and health care organisations
Colleges and universities
Federal, provincial/state and municipal government departments
Transportation companies
Telecommunications companies
Insurance, finance and banking organisations
Management consulting companies
Manufacturing companies

As we conclude our talk on data, it’s clear that

Numbers are not just numbers

They can speak

And it’s up to us to listen

dasaptaerwin.net

Several more things about data analysis