32 Data Science Terms Every Data Scientist Should Know

This glossary is developed to help simplify the terminology of data science.

We have listed below data science terms that you should be familiar of:

Data Science Terms / Glossary

Data Point: Data point is a measurement or observation, such as an individual’s height or age.

Machine Learning Algorithm: Machine Learning Algorithm is any stored process which takes some input, analyses it and provides an output. This could be something like facial recognition software which determines who you are based on your face shape, hair color etc. It could also be SQL query that retrieves statistics on your website.

Model: A model is a representation of how data is presumed to relate to one another, often taking the form of a graph or formula. A model can also refer to predictions made using the model.

Dimension: A dimension is one of the variables to describe a data point. For example, if a data point was a person, their height and weight would be two dimensions. If a dataset has 3 or more dimensions it’s called a ‘high dimensional space’ and can involve complex models.

High-Dimensional Space: A high dimensional space is a space with more than 3 dimensions. Generally speaking, data science models are less accurate in high-dimensional spaces because of the extremely complex nature of the data.

Data Scientist: Data scientist or practitioner is someone who uses statistics and machine learning to analyze large quantities of data. They can often be involved in building machine learning models.

Data Analytics: Data analytics is the process of examining datasets for patterns and trends, usually from a more business point of view.

Business Intelligence: Business intelligence (BI) is a discipline that focuses on using data to create business value. This includes tools such as BI dashboards, reporting and data presentation.

Machine Learning: Machine learning is the process of using algorithms to build predictive models from input data in order to achieve a particular goal. One example would be language recognition software, such as Siri or Google Translate.

Predictive Analysis: Predictive analysis is when a model tries to predict future results based on historical data.

Data Visualisation: Data visualisation is the process of transforming numbers and numerical data into a more interesting format, often in an easily understandable form. This could be something like a graph or pie chart.

Artificial Intelligence (AI): AI refers to computer programmes which can complete tasks normally done by humans. For example, Apple’s Siri is an AI.

Natural Language Processing (NLP): An application of artificial intelligence, NLP is the ability to process language in the same way our brains do. This could be converting speech to text, something like Google Translate or summarizing a number of paragraphs into a few sentences.

Sentiment Analysis: Sentiment analysis is the process of distinguishing the real meaning of text, often used in social media. For example, if someone wrote “I hate my job” on Twitter, sentiment analysis would determine whether they actually liked their job or not.

Unstructured Data: Unstructured data is information that doesn’t fit into a pre-existing data model. It’s often text-heavy and hard to analyze because there is no ‘structure’ to follow. Unstructured data can be found in sources such as social media, web pages or a PDF file a client has sent you.

Structured Data: Structured data, on the other hand, is information that does fit into a pre-existing data model. A spreadsheet would be an example of structured data because each cell in the sheet has a specific function and follows a set structure.

SQL: SQL stands for Structured Query Language and it’s used to interact with databases through coding. SQL can perform tasks such as finding specific information or updating the database.

Hadoop: Apache Hadoop is a free and open-source software framework that allows for distributed processing of large datasets across clusters of computers. It’s mostly used in big data applications such as those which analyze social media trends, company financial reports and scientific research papers.

Map Reduce: MapReduce is a processing model that reduces an input set of data into smaller sets, where the information can be processed as desired. For example, it’s used in large-scale web analytics to summarize how many page views were received from each country.

NoSQL: NoSQL stands for ‘Not Only SQL’ and is a type of database that doesn’t follow the traditional relational model for storing data. NoSQL databases can handle large volumes of structured, semi structured and unstructured data.

Apache: Apache is an open-source software platform which runs over 200 million websites and powers 87% of the world’s websites. Apache is written in a combination of C and Java, and focusses on stability, reliability and speed in order to remain the most widely used website server in the world.

R: R is an open-source programming language specifically designed for statistical computing. It’s used by statisticians, data miners, scientists and analysts to perform a variety of tasks such as statistical analysis, data visualization and predictive modelling.

MySql: MySQL is an open-source relational database management system (RDBMS) that uses Structured Query Language. MySQL is mainly used for business applications, data warehouses and websites with high traffic.

NodeJs: NodeJs is an open-source, cross-platform JavaScript run-time environment that executes JavaScript code outside of a browser. It’s mainly used for developing scalable network applications and building a real-time app with push capabilities.

Programming Language: A programming language is a set of rules and specifications that govern how commands are given to a computer. There are many different types of programming languages, some more common than others. C#, Java, Python and Ruby are all examples of popular programming languages that developers often use in their day-to-day jobs.

JSON: JSON stands for JavaScript Object Notation and is a lightweight text-based open standard designed to transmit data. It’s used primarily in web-based programming, but has been gaining traction recently in non-web based apps too because it’s easier to read and write than XML.

XML: XML stands for Extensible Markup Language and is a flexible text-based format that uses tags and attributes to define data structures. This makes it perfect for sharing data across different platforms, but does mean it’s not as easy to read and write as JSON (see above).

Object-Oriented Programming: Object-oriented programming (OOP) is a programming paradigm that breaks down tasks into objects which interact with each other. Objects are basically like real-life things such as people, spaceships and robots.

Robot: A robot is an automatically operated machine that doesn’t need constant human control to complete its tasks. Robots can be used for many different jobs such as welding, cleaning and even delivering (see below).

Mainframes: A mainframe computer is a large machine that’s used for storing and processing data. Mainframes are more powerful than your typical home computer but not as flexible because they’re designed to do one thing really well. They can handle large amounts of data and usually have a processing speed of around 20 to 30 million instructions per second.

Processing Language: Processing is a programming language made for artists, designers and beginners that runs on your typical home computer or laptop. Thanks to its simple interface, getting started with Processing isn’t difficult at all and there are plenty of guides available online that will help you understand the basics.

Platform: A platform is basically a system in which things can be done and provides the tools needed to make it happen. Facebook, Google and Twitter are all examples of popular platforms.

So this ends the terms related with the field of Data Science.

Thanks for reading!