Big data is the new trend in computer science, there is even a term for people who do big data – “data scientist”. To find out what is this big data, I am now taking Introduction to Data Science by University of Washington instructed by Bill Howe.
Data science is a new field, and as a new field, people like me may feel skeptical about this. In the introduction, Bill Howe admits that this is still a new field, and data scientists still do not fully understand it. Some people even compare this to the “webmaster” of the early days when the web is still new, jack of all trades, master of none.
First, there is a need for data science as cheaper sensors allow us to collect more information that we can analyze. For example, Apache Point telescope, SDSS collected 80TB of raw image data over a period of 7 year, while the Large Synoptic Survey telescope built this year collects 40TB of data per day, that is a huge amount of data. Another example, Illumina HiSeq 2000 Sequencer collects 1TB of data per day from genome sequencing, each major lab has 25-100 of these. The web is also a big place, having 80+ billion web pages, totalling 400+ TB.
There are three challenges in Big data, namely volume, velocity, and variety. Volume is the size of the data, velocity is the speed of transfer of data, and variety is the different sources of data. Also highlighted is the role of big data in business, company are hiring data scientists and making judgement based on analysis of data.
There are three tasks of data scientist, preparing a model, running the model, and communicating the result. 80% of the time, data scientist will be preparing a model, that is gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging data before analysing it.