Data science in industry is booming, and as a result, there has been an explosion of available roles with overlapping skill sets. If you are coming across role titles like “data scientist,” “research scientist,” and “machine learning engineer” and are unable to discern the difference, this post help will clear the confusion. I will go through the most common terms and describe the similarities and key differences to help you in your job search or for general understanding.
Table of contents
- What is a Scientist?
- The Difference Between a Research Scientist and a Data Scientist
- The Difference Between a Data Engineer and a Data Scientist
- The Difference Between a Machine Learning Engineer and a Data Engineer
- Communicaton is Key for All Roles
- The Size of the Company Impacts the Role Definition
- The Accessibility of Tools Has Shifted the Requirements for Data Science
- Concluding Remarks
- Roles with “Scientist” in the title have a focus on using statistical data analysis and machine learning to extract insight from data relevant to business solutions.
- Data scientists, in particular, use insights to build new products or inform current business solutions.
- Research scientists are focused on the research outcome; hypothesis defining and proving/disproving through experimentation.
- The outcome of research is not necessarily a business solution.
- Roles with “Engineer” in the title are focused on building the infrastructure and tools to deliver end–to–end solutions to business problems.
- Research scientists are growing in demand.
- Data scientists can be specialized by their analytical, algorithmic, and inference focus.
- Exciting opportunities for academics wanting to transition to industry but continue to research.
What is a Scientist?
The definition of a scientist is someone who systematically gathers and uses research and evidence to define hypotheses and test them, furthering understanding and knowledge. Further definitions of scientists can be made using three questions:
- How are they going about their process?
- What are they seeking understanding of?
- Where are they applying their science?
Therefore, in the case of a data scientist:
- How: Data scientists use data wrangling, statistical data analysis, and machine learning to gain insight.
- What: Patterns in typically large amounts of data for business needs.
- Where: Data scientists are industry ambiguous, as the focus is on the data, not the source.
Data scientists are focused on gathering and organizing typically large amounts of data to optimize processing, deliver automation, and solve strategic problems in business. There is an emphasis on visualization and presentation as most people will not readily understand why the information extracted from data is important until it is translated into a simpler visual format.
Data scientists have some commonalities with data analysts, but they are not the same thing. The main differentiator is the level of technical expertise the two professions require. Data scientists are more specialized and need coding skills, machine learning knowledge, predictive modeling, and analytics. Data analysts find answers to a given set of questions from data but are not focused on creating hypotheses nor addressing the “what-ifs?”
The Difference Between a Research Scientist and a Data Scientist
Research scientists have the skillset of data scientists. However, the outcome of the two professions is the main differentiator. The definition of research is the systematic investigation into sources to establish facts and reach new conclusions. Therefore, research scientists are focused on proving or disproving hypotheses through experimentation, without necessarily materializing a business solution. A research scientist will typically go through framing experiments, defining hypotheses, analysis and data collection, results, and interpretation.
Research scientists tend to be focused more on innovations in machine learning. Data scientists will have more contact with a business solution and thus may have to build part of or entire products and pipelines. Research scientists will rarely cross over to product design. However, they may be involved in prototyping innovations to hand over to software engineers.
Research scientists are more independent of the business outcome than traditional data scientists. Research scientists materialize results through documentation both internally within the company and in the form of research publications.
Research publications in peer-reviewed journals and proceedings for conferences are important ways of demonstrating contributions to the frontier of the given field, whether that is through introducing a discovery, surveying and comparing existing research or building on existing research by adding a novel perspective or methodology.
Achieving publications is one of the chief motivators of research scientists and serves as a powerful marketing tool for companies and who hold a high standard of having scientific rigor in the algorithms that are inside their products.
What is an Applied Scientist?
An applied scientist is the hybrid of a research scientist and a software engineer. Research is integral to the applied scientist role, but the focus is on the implementation of the knowledge gained in solutions at scale. A research scientist is focused on solely scientific discovery, an applied scientist is focused on making the discovery and applying it. An example of applied science could be enhancing the online customer experience by using a virtual assistant with natural language understanding algorithms.
The Difference Between a Data Engineer and a Data Scientist
A data engineer focuses on the development, construction, testing, and maintenance of architectures for data. A data engineer does not need to provide insight to inform business solutions nor define hypotheses to prove. Instead, a data engineer will take the answers produced by either data scientists or research scientists and construct the end-to-end solution to meet the business requirements. There is an emphasis on building scalable, high-performance infrastructure that handles each stage of the solution pipeline: collecting, managing, analyzing, and visualizing data. Often there is a requirement for batched or real-time analysis to process streaming data.
Data scientists are in constant interaction with infrastructure built and maintained by the data engineers but are not necessarily responsible for building and maintaining the infrastructure. Building the best solution requires both data scientists and data engineers to complement each other’s skills. For more information on how companies can effectively build teams to leverage data-driven insights see this article: Data Science is not an island – ByteSumo Limited.
The Difference Between a Machine Learning Engineer and a Data Engineer
Machine learning engineers (ML engineers) are a hybrid of data scientists and software engineers. ML engineers are focused on developing algorithms for machines to make automated decisions. A system that relies on an ML engineer could be a self-driving car or a tailored newsfeed that uses keyword scraping.
ML Engineers will need to build the end-to-end solution to leverage machine learning and also have the intimate knowledge of algorithms and underlying concepts of machine learning. They will typically feed data into models defined by data scientists, or by themselves and scale the models to production-level to handle data-intensive problems. Even though an ML engineer requires understanding and use of machine learning, they do not define hypotheses nor frame experiments like data scientists or research scientists.
Communicaton is Key for All Roles
A critical aspect that unites all of the roles mentioned above is the necessity for communication. The concepts handled require a significant level of technical expertise; therefore, what may appear obvious to a data scientist might be lost on a product delivery consultant, for example. New ideas must flow from the minds of the researchers to the engineers and then to the business to reap the benefits of innovations in machine learning.
Research scientists have a significant hurdle for communication as within a company they are often insulated from the engineering teams. Research scientists are often seen as the “Men In Black” division working on classified, highly specialized projects that will never see the light of day. A shroud of mystique is often around deep learning applications developed within the science teams. While it can be tempting to celebrate magic black boxes that “solve all problems”, that mystique can evaporate when some unforeseen challenge stops the black box from working as expected.
To get around this issue, the focus of communication across teams should always center around “the why?” If the messaging strays too far from the big picture and gets lost in technical knowledge – non-experts will switch off or become muddled. If technical knowledge is a must in the message, there should be an immediate framing of that information. The “why?” will be centered around the product or service that the other teams are working on, which means it is still a scientist’s task to know the domain of the company and the driving factors of what makes a product or service useful and the potential improvements to make.
Scientists should also seek to frame a presentation from the perspective of an engineer. Going back to what an engineer does, they will think in terms of “how can this be programmatically realized?” or “how will this solution be scalable to our needs?” If a scientist can frame at least a part of the communication from that angle, more fruitful dialogue can occur, and ideas can put into action faster.
The engineer will need to be able to build the solution and demonstrate why the data insights from the scientists are valid within the business solution. There is less of a requirement to explain the “why” it works in technical detail. The engineer will focus on “how does it work?” “How does it scale?” and “how well does it work in comparison to other solutions?”
Communicating research to an outside audience is one of the biggest challenges a scientist will face. It is one of the reasons why outreach is actively promoted in universities for STEM fields, to build a conversation between the expert and non-expert and allow ideas to be shared. If you are currently in a technical role and can develop a robust discussion within your company, it will improve your thoughts, bolster your confidence and allow your research to be more integral to the company vision.
The Size of the Company Impacts the Role Definition
With larger companies comes the ability to specialize roles. Companies that are smaller or younger will tend to have positions that involve handling the requirements of both scientists and engineers. These positions arise for several reasons, for example:
- The data pipeline has not been fully established or is still in prototype format
- There are not enough people to split tasks according to strict role definitions.
Large companies like, for example, Amazon or Google will have dedicated science teams, because (a) they have teams of engineers dedicated to each part of their solutions (b) they are committed to innovation in machine learning and have the resources to explore. Data scientists in large companies will generally be more focused on the analytical side of the produce, whereas, in smaller companies, data scientists will tend to be more similar to ML engineers.
The Accessibility of Tools Has Shifted the Requirements for Data Science
As data science is becoming more integral to business solutions, the tools available to simplify and automate the data science process has increased. With platforms like h2o, machine learning and data science have become widely more accessible. State-of-the-art models can be accessed and used with a few lines of code in a Jupyter notebook. As a result, there may be some roles that are more focused on the statistical data analysis side of data science and others that are focused on the algorithmic development and solution building side. This diversification of the data scientist role is crucial to bear in mind when looking for a position that caters to your strengths.
By setting out the precise definitions of science, research, and engineering, the types of available data science roles are more coherent. There are commonly used definitions for data scientists, research scientists, machine learning engineers, and applied scientists. However, several factors will change the description of the role from company to company.
It is imperative that if you are looking into roles, you read the requirements and tasks. There are instances of companies luring candidates with the promise of data science but are offering data analyst roles. If you are seeking experimentation using statistics and machine learning, you may not feel satisfied in such positions. Similarly, if you are focused more on research and analytics and not the end-to-end solution building, it is crucial to be able to differentiate between an engineering role and a scientist role.
Bear in mind that the company size can indicate the variability of your role and the amount of overlap you will have with science and engineering. So if you are vehemently opposed to any engineering, aim for larger companies that have well-defined science teams. Company size is not the defining factor for all cases; you can only honestly know what the role involves communicating with the company. You can investigate through reading the role proposal and speaking to current employees.
Data science is not a monolithic profession and can separate into further specializations: analytical, algorithmic, and inference. Evaluating your strengths will better inform what data scientist role suits you, should you aim for that career path. See Elena Tej Grewal’s post – One Data Science Job Doesn’t Fit All for more details on the types of data scientists that exist.
If you are an academic and want to pursue scientific discovery in industry, the “research scientist” role might be the one for you. Companies increasingly realize the value of research and its application in business, so there are more opportunities than ever before. In future blog posts, I will go into more detail about the research scientist role and tips to finding the best position for you.
Thank you for reading. I hope this post has shed some light on the difference between the various roles that crop up in the world of data science and machine learning. If you enjoyed this post, sign up to the Research Scientist Pod mailing list to stay up to date on the latest blog posts. Be sure to share the post and post your comments below if you have any questions or ideas you want to discuss further.
Suf is a research scientist at Moogsoft, specializing in Natural Language Processing and Complex Networks. Previously he was a Postdoctoral Research Fellow in Data Science working on adaptations of cutting-edge physics analysis techniques to data-intensive problems in industry. In another life, he was an experimental particle physicist working on the ATLAS Experiment of the Large Hadron Collider. His passion is to share his experience as an academic moving into industry while continuing to pursue research. Find out more about the creator of the Research Scientist Pod here and sign up to the mailing list here!