Tag Archives: data analysis

Ch 7: Analisis (Menulis–ilmiah–itu menyenangkan)

Blogpost ini akan sedikit lompat dari yang kemarin. Sekarang kita coba langsung ke Bab Analisis dalam proyek buku selanjutnya (WTF: (scientific) Writing is Totally Fun).

Sebagai ahli kebumian,  output utama kita biasanya peta, tapi sejalan dengan perkembangan teknik analisis spasial, (geo)statistik, dll, maka presentasi data dalam bentuk tabel dan grafik (chart) juga memegang peranan penting, terutama bagi pembaca awam (tidak punya latar belakang ilmu kebumian). Mereka juga akan mencermati angka-angka yang tertera dalam tabel dan grafik.

Berikut ini sebuah rujukan ringkas dari Dinas Statistik UK yang straight forward mengenai penyajian data. Beberapa catatan sangat penting untuk dicermati, misalnya:

  • di mana meletakkan variabel dan lokasi pengukuran (atau sampel),
  • bagaimana mengurutkan dan membandingkan data dalam tabel,
  • pentingnya memberikan anotasi garis pada grafik, misalnya untuk memperlihatkan batas ambang.

Saya akan coba sarikan dengan memperhatikan sumber lainnya dan pengalaman (pendapat) pribadi saya dalam blogpost berikutnya.

Screen Shot 2015-07-10 at 05.40.40

(dipinjam dari: http://style.ons.gov.uk/category/data-visualisation/)

1st Circular: Indonesia R Meet Up

  

Karena ternyata sudah banyak yang “terungkap” sebagai Pengguna R (pada tahap beginners hingga advanced), sudah saatnya merancang acara R meet up. Contohnya seperti ini: http://r-users-group.meetup.com/.

Komunitas Indonesia R User akan menyelenggarakan Indonesia R Meet Up yang pertama, dengan tema R 4 All.

TORnya sebagai berikut:

  • Siapa saja yang boleh hadir: semua yang berminat dan join R user group.
  • Siapa saja yang boleh mengirimkan abstrak: pengguna R (tidak ada batasan kompetensi), dan harus join R user group.
  • Apa saja yang boleh dipresentasikan: semua topik asal menggunakan R.

Topik sementara ini hanya dibagi dua: 

  1. Natural sciences (termasuk kedokteran, kesehatan)
  2. Social sciences (termasuk ekonomi)

  • Bagaimana format abstrak: 200 kata, ada max 5 kata kunci, menggandung latar belakang, metode, hasil, kesimpulan, rekomendasi. Kode R disampaikan sebagai lampiran.
  • Dikirim ke mana: di post di Wall R User Group.

Masukan-masukan terhadap penyelenggaraan acara agar dapat ditulis di kolom komentar.

Terimakasih.

Data is the new soil

Data is not the new oil, but it’s the new soil (David McCandless, TedTalks)
Anda sudah pernah lihat video Mas David McCandless di Youtube? Kalau belum lihat ya. Mampir juga nonton video gurunya, Hans Rosling.
Data juga dapat mengungkap hal-hal yang tersembunyi di balik suatu fenomena yang kita hadapi. Tapi data dalam bentuk tabel kurang cepat memberi gambaran apa yang terjadi, karena itulah kita perlu visualisasi.
Untuk para geologiwan, peta geologi adalah hanya salah satu saja bentuk visualisasi. Peta ini mentransformasi tabel jurus dan kemiringan batuan, jenis batuan, dll menjadi zonasi batuan, penampang lapisan batuan, garis sesar, lipatan dll, ke dalam selembar kertas.
Berikut ini adalah contoh lain dari visualisasi yang bisa kita lakukan untuk mengungkap sesuatu yang sebelumnya tidak terlihat. Saya menggunakan “R” dan “mtcars” data set. Data set ini adalah salah satu saja dari data set yang sudah menyatu di dalam distribusi R. Data ini diekstraksi dari Majalah “Motor Trend” tahun 1974 yang terbit di AS. Isinya adalah 10 variabel spesifikasi dari 32 merek mobil buatan tahun 1973 – 1974 Motor Trend US magazine. Spesifikasi variabel atau parameter spesifikasi yang ada dalam data ini adalah:
mpg Miles/(US) gallon              di Indonesia dibaca sebagai km per liter BBM
cyl Number of cylinders         jumlah silinder
disp Displacement (cu.in.)        di Indonesia mengenalnya sebagai cc
hp Gross horsepower             tenaga kuda
drat Rear axle ratio                   di Indonesia dikenal sebagai rasio gear
wt Weight (lb/1000)              berat kendaraan
qsec 1/4 mile time                      waktu yang diperlukan dari diam hingga 0.25 mil
vs V/S                                     ada yang tahu ini apa?
am Transmission                      transmisis matik atau manual
gear Number of forward gears  jumlah gigi maju
carb Number of carburetors      jumlah karburator
Data ini pertama kali dianalisis oleh Henderson and Velleman (1981) dalam papernya Building multiple regression models interactively. Biometrics, 37, 391–411.
Ada yang masih menggunakan Ms Excel? Anda mungkin akan berpikir ulang.
Saya akan menggunakan R dalam membuat beberapa visualisasi dalam bentuk grafik sebagai berikut.
Dengan perintah “pairs(mtcars, main = “mtcars data”)” anda sudah bisa mendapatkan grafik matriks korelasi seperti di bawah ini. Coba anda lihat, banyak yang bilang keiritan mobil (mpg) hanya ditentukan oleh cc. Karena itu kalau di Indonesia harga mobil bekas ber-cc besar akan “jatuh bebas” dibandingkan yang ber-cc kecil.
Coba kita lihat grafik 1 di bawah ini sebagian saja. Tarik garis diagonal yang ada tulisan “mpg”, “cyl” dst, dan pilih setengah saja, apakah anda ingin lihat setelah segitiga yang atas atau yang bawah. They’re all the same. Let’s just choose the lower part.
Anda lihat pola titik-titik data yang membentuk garis lurus atau mirip garis lurus (berarah diagonal), dan ada pula yang acak. Pola yang pertama menunjukkan adanya korelasi antara kedua parameter dan pola yang kedua memperlihatkan korelasi yang sangat kecil atau bahkan tidak berkorelasi sama sekali.
Pola yang membentuk keteraturan diagonal dengan mpg adalah:
  • mpg ~ disp (cc) -> keiritan dengan cc
  • mpg ~ hp -> keiritan dengan tenaga kuda
Korelasi yang relatif lebih lemah terlihat antara:
  • mpg ~ drat -> keiritan dengan rasio gear
  • mpg ~ wt -> keiritan dengan berat kendaraan
Korelasi yang lebih lemah (tapi ada) adalah antara:
  • mpg ~ qsec -> keiritan denga waktu yang diperlukan dari diam hingga menempuh 0.25 mil
Dari sini terlihat bahwa keiritan mobil anda bukan hanya ditentukan oleh besar kecilnya cc, tapi juga dengan gaya mengemudi anda (diwakili variabel qsec).
image
Gambar 1 Correlation matrix BW
Kalau anda senang warna, maka dengan satu baris perintah “corrgram(mtcars)” anda bisa membuat grafik sejenis di bawah ini.
mtcarscorrgram2
Gambar 2 Correlation matrix berwarna
Grafik pada Gambar 3 berikut ini juga dibuat hanya dengan sebaris perintah “heatmap(as.matrix(mtcars))”. Juga dapat dilihat hirarki pengelompokkannya dalam bentuk garis. Yang menarik adalah:
  • bagaimana merk-merk Jepang mengelompok dengan merk Eropa, sementara merk Amerika membentuk kelompok sendiri (kecuali Dodge Challenger, AMC Javelin, Hornet, dan Valiant). Saya akan bahas di lain waktu.
  • Corolla dan Civic sekelompok dengan Ferrari Dino dan Fiat, Mazda dengan Merc 280 berada pada kamar yang sama, serta Corona, Datsun di dalam ruangan yang sama dengan Porsche.
Sangat menarik bukan. Yang seperti ini sangat bisa diterapkan di geologi juga.
IMG_0020
Gambar 3 Heatmap dan PCA
Yang menarik lagi grafik pada Gambar 4 berikut ini. Ada yang bisa menjelaskan? Saya akan bahas di lain waktu, atau sekaligus saya harus alih profesi jadi wartawan tabloid “Otomotif”.
Masih mau pakai Excel?
🙂
image
Gambar 4 Analisis mpg ~ disp (cc) ~ cyl

 

Links to learn data analysis in Excel for (geology) students

datacloud

(image from: homepages.spa.umn.edu)

Dear friends,

The following post was originated from my email to two students under my supervision. They’re both working on data analysis. Instead on pushing them to learn new software, R for instance, I spread some seeds for them to learn deeper into software that they’ve already known. Some students, or it’s my students, have known spreadsheet for years, but they’ve not maximised its potentials.

Assalamu’alaikum wr wb
Dear AAA cc BBB
To continue our discussion to develop more statistical analysis on the project, here’s a list of my recommended links:
  1. http://annkemery.com/excel/: a website owned by Ann K Emery, a data consultant. If you have trouble opening it, you can directly go to her Youtube channel (just search her nama on Youtube). Based on her post, you can learn that becoming a commercial consultant, she can be so generous in sharing her knowledge to public.
  2. http://www.youtube.com/watch?v=5MFjwM6K5Sg: a descriptive statistical analysis using “data analysis toolpak” add-on in Excel. You have to install the add-on first from MSOffice installer.
  3. http://www.youtube.com/watch?v=zEXK6M93lb8: correlation analysis using “data analysis toolpak”.
Just try it ok. I really hope the other students can have more skill in doing data analysis. It’s really important to dive into the analysis, not the software.

Several more things about data analysis

Outline

Part 1: Data in business

Author: Dasapta Erwin Irawan

1.1 Introduction

In this online era, we are surrounded by something called data. Back then, data is only considered to be related to laboratory works, school projects, etc. Now, data is all around us. Data was only something we measure, but now it is something we trade as goods. People is interested to anything that can be converted to data observation or measurement. Some says data is the by product of digital existence. People tend to analyse anything. They even interested to the rise and fall of the name “Jennifer” being used as girl’s name through out time. Per say, data is now an everyday talk in coffee shops. Or maybe in the wet market, when people talks about the rise and fall of cabbage price.

1.2 Data in business: why is it so important

1.2.1 Forecasting: we need to know the future

What’s all the excitement about data analysis. Forecasting is one thing. People always need to know what happen to the future, given with the existing condition as baseline with some assumptions and chances of disruption along the way. Forecasting is one of the main part of business. So important, that a business proposal would likely to be thrown away if it does not contain any data-based forecasting.

A time series is a collection of measurements of well-defined items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series. This is because sales revenue is well defined, and consistently measured at equally spaced intervals. Data collected irregularly or only once are not time series. A time series can be decomposed into three components: trend component (long term direction), seasonal component (systematic, calendar related movements) and irregular (unsystematic, short term fluctuations). The decomposition is important in data analysis. Because what we see in the chart doesn’t necessarily happen in real life. For instance, let’s see the chart of turkey sale. It would likely to be high around Thanksgiving. So our first guess would be it’s a cyclic phenomenon. But what if there’s a shift in the time range in a certain year. We wouldn’t expect that the data of Thanksgiving had been shifted. What if there’re more than two peaks of turkey sale in a year, and so on.

Let’s think of data as something that has its own behaviour. It may have a rhythmic natural-born behaviour or erratic. It may have a stiff and dull attitude, that insensitive to external influence, or it may have a flexible nature and very sensitive to outside parameters. Or, it just may have an erratic behaviour without having any major controlling parameter. As you can see at the following picture, forecast is part of a loop. It analyses and transforms performance into decision making inputs. This loop drives the evolution in a business model or organisation.

forecast

Fig 1 A business cycle involving forecast in the loop (from: University of Baltimore web page)

1.2.2 Decision making: reaction to every action

There is reaction to every action

I’m not into physics by that, as probably most of you too, but like an organisation, business is always changing. They evolve through the test of time. The Apple Corp now we see, is not the same as it was back in the 70’s. Or as oppose to the fore-mentioned proprietary vendor, let’s see the open source software. Let’s say Linux, an open source operating system. Firstly built as a personal computer science project by a personal Finnish student, namely Linus Torvalds, Linux now is a multi billion dollar business. You can see that a technology that started as a free for all technology, is being transformed to profit-oriented object. The surprising news is, the free-for-all Linux is still marking its way along with its commercial side. This product can change the way people see free stuff. The operating system itself is still free until now, but it is the service of building based on time series data. The operating system itself is still free until now, but it is the service of building and maintaining the Linux system that is highly commercial. See, that’s evolution and it involves data.

In the next article, we are going to talk about how we can see correlation between parameters in a data set, how we build a model, and then in the last article, we will discuss about a new profession called data scientist will be discussed.

Part 2: Looking for correlation

We have discussed how data can change the form of a business or an organisation. All the changes that might happen are based on data forecast. Now we’re going to talk about the second reason why data is important in business. Instead of only seeing a time series chart, people also needs to know what correlation can be drawn from it. Let’s just use an example.

A simple correlation case was brought by Bryant and Smith in a paper entitled Practical Data Analysis: Case Studies in Business, in 1995. They showed a case of data set containing measurements taken on dining parties in a restaurant by a single waiter. The variables include total bill ((), tip ()), gender of the bill payer, day of the week, and the tip as a percentage of the total bill. They wanted to see what variable or variables has or have the strongest influence to total tip in a week. They also compared the tip from male and female customer.

We can see in the chart, that total bill size and tip are positively associated (upper left scatter plot), but not as strongly as one might expect because there is increasing variability in tip as bill increases. Both tip and total bill have skewed distributions (upper left histograms), which might lead the analyst to consider log-transforming these variables.

Males spend more on average than females and bills are higher on the weekend (shown in the side-by-side box-plots). The 70% tip on a very small bill by a male on a Sunday may be an outlier. Much can be learned about tipping behaviour by studying this chart.

waiter

Fig 2 An example of correlation chart of multiple variables (from: National Library of Australia)

But we must put into account that correlation doesn’t always mean causation. If we see the above-mentioned case, indeed there’s a correlation between male and female customer and their tipping behaviour. But what drive the attitude had not been discussed yet. It generally lie underneath the number, that we have to dig out.

Many studies are actually designed to test a correlation, but not a causation. In general, it is extremely difficult to establish causality between two correlated observations, but on the other hand, there are many statistical tools to establish a statistically significant correlation.

You would be surprise how common sense conclusions about cause and effect might mostly be wrong. That is because a correlation can be due to two frequent correlated occurrences. Or a correlation may also be observed when there is a strong causality behind it, for example, it is well-known that cigarette smoking not only correlates with lung cancer, but actually causes it. But the hardest part is, in order to establish cause, we would have to rule out the possibility that smokers are more likely to live in urban areas, where there is more pollution — or any other possible explanation for the observed correlation.

So we can say that, causality can be started from series correlations. But we have to add some controlled variable in the analysis. As shown in the smoking example. We have to set the assumptions and narrow down the potential governing variables. We call the result as a model.

In the 3rd and 4th part of this “Data Talks” article, we are going to talk about “Model” and “Data Scientist”,.

Part 3: How to build a model

The model is the most basic element of the scientific method. And business is just as close as physics in science. Probably without noticing, we’ve talked about “model” in the previous “Forecast” and “Decision Making” parts. Both terms are brought by mathematical models in form of equation. You must know about linear regression (see the following figure) or remembered learning this subject in algebra. It is just one model among many others that does the actual forecasting for us.

linear

Fig 3 An example of linear regression model (from: National Library of Australia)

We also talk about model when we saw the business loop diagram in previous article or if we buy our children Hot Wheels or Barbie. Even a recipe is a model. So we could say a model as an simplification for what we are actually studying or trying to predict.

This is how we build a model:

  1. Data gathering. We talked about it in the forecast article. It can be a long time series, as long as, a rainfall data set, or from a questionnaire.
  2. Setting the assumptions. Most model only work in a controlled environment. Therefore we have to set the boundaries. The more boundaries, the more narrow our model will be. How many boundaries we should have? Answering this could be an itterative process with step number 3 and 4.
  3. Model fitting. This is the fun part. We can use major proprietary software like Stata, SPSS, and SAS, or you can choose the free one, like R. Those software contain many equation models that we can pick and test later on.
  4. Model calibration. This part also automatically done by softwares. Basically, we apply our chosen equation to a new data. If the result behave the same way with our modelled-data, the one we used in step number 3, then we can say our model is actually working. If not, then we have to go back to step no 3 or even number 2.
  5. Model application. This is the phase that we like the most. But, through time, we have to evaluate our model, based on the current situation.

Another thing we have to bare in out mind is, the Law of Simplicity. The simplest model has higher chance to be received in business environment. Top executive would probably put less care about model with 11 variables. Two or three variables model is frequently chosen by a data scientist of previously known as data analyst. In the 4th part of this “Data Talks” article, we are going to talk about “Data Scientist”, a new blossoming career for mathematicians, statistician or computer scientists.

Part 4: Data scientist

It was not until five years a go, people invented a new kind of profession, called “data scientist”. A data scientist represents an evolution from the business or data analyst role. A solid basics typically in computer science and applications, modelling, statistics, analytics and math. We are talking about a one powerful career that can predict the future, talk about it, and persuade others. A good data scientists will not just address business problems, they will pick the right problems that have the most value to the organisation.

datasci

Fig 4 A profile of data scientist (from:Emc^2 web site)

The work of a data scientist would more or less cover the following aspects (extracted from a coursera forum):

  • Formulate context-relevant questions and hypotheses to drive data scientific research
  • Identify, obtain, and transform a data set to make it suitable for the production of statistical evidence communicated in written form
  • Build models based on new data types, experimental design, and statistical inference

Aside to the proficiency in computer science, math and statistics, a good data scientist must have the curiosity, creativity, focus and attention to detail.

Data scientist is always needed as far as there’s data involve in an operation. Companies that hire data scientist include:

  • Construction companies
  • Utility companies
  • Oil, gas and mining companies
  • Hospitals and health care organisations
  • Colleges and universities
  • Federal, provincial/state and municipal government departments
  • Transportation companies
  • Telecommunications companies
  • Insurance, finance and banking organisations
  • Management consulting companies
  • Manufacturing companies

As we conclude our talk on data, it’s clear that

Numbers are not just numbers

They can speak

And it’s up to us to listen

 

 

Blogpost arrangement

UWS lake
UWS lake

 

 

 

 

 

 

 

 

[image from: personal collection, a path way in UWS MacArthur Campus]

Dear friends,

It’s another 10 degrees morning in Sydney. I know the number is fairly warm for most of you, but as an Indonesian, it needs an effort to keep the finger to hit the right key.

After thinking about the position of my blogs, I will:

  1. gradually this blog, my oldest blog, about Linux and other tech-internet
  2. focus my posts on R and data analysis on my Blogger site
  3. gradually move my posts about hydrology and research/teaching to my other new WordPress blog, My Online Water Books. I try to post the pdf format in every post.

I’ll post all of my updates on my Twitter (@dasaptaerwin) and Google Plus (+Dasapta Erwin Irawan).

Thank you. All the best for you all.