Simply put data scientist is someone who finds new discoveries something what a scientist does. They make a hypothesis and then they investigate that hypothesis. In case of a data scientist they do it with data. They look for meaning and insightful knowledge from the data in hand. Data scientist derives this knowledge primarily by:
- Visualize the Data: Data Scientists visualize the data, they look into the data, creates reports and look for patters in the data. This sounds very similar to a traditional business analyst or data analyst but this is an important part of a data scientist work. Similar to lab experiments done by physicists which can also be performed by lab assistants.
- Use of Advanced Algorithms: What really distinguishes a data scientist from a data analyst is the use of advanced algorithms that actually runs through large data sets to drive meaningful results. For examples, algorithms like machine learning, neural network algorithms and many more algorithms that actually look into the data. To run these algorithms, data scientists must have a strong foundational knowledge of mathematics, statistics and in some cases computer science and domain knowledge.
Data Scientist work usually revolves around answering a pressing questions based on the data available. For example, How many AT&T customers are going to churn (go to a competitor) in next 3 months? Or Netflix offered $ 1 million to any data scientist who can improve their movie recommendation by 10(http://en.wikipedia.org/wiki/Netflix_Prize). Hence, data scientists are answering important questions by using algorithms on available data.
When you have large data sets then you need multiple algorithms to deal with diversity in the data. Hence, a data scientist must be aware of various algorithms and their implementation. There are many myths surrounding who is a data scientist, so let’s first clear out who is not a data scientist:
- A data scientist is not a programmer who knows Hadoop. There are many people who are calling themselves data scientist because they have certain technical skills. Apart from basic technical skills they need to understand various algorithms like machine learning, neural networks etc to derive meaningful results out of the data.
- A data scientist is not a business analyst: Business Analysts create various reports based on what they think is important in data based on the domain knowledge they have acquired over the Years. However, a data scientist would hypothesis what they think is important in data and then run various algorithms but confirm their hypotheses.
Hence, a data scientist is someone who not only knows the basic programming knowledge but also business knowledge and a solid algorithms, mathematics and statistical knowledge. Considering such a steep requirement of being a data scientist it is quite evident that they are very few in numbers and in great demand.
So, what is the life cycle of a data scientist? The first stage of a data scientist is entrapping the data or getting all the data you need before you run algorithms on it to derive meaningful results. It is estimated that more than 70 to 80% of a data scientist time is consumed in assembling the data like sequel statement, text mining etc. Now, this is pretty much wastage of a data scientist time as these tasks can be performed by data integration specialist. After data integration, data scientist runs various algorithms to derive meaningful results from the data.