Data Science Consultant at almaBetter
Welcome to this interview preparation guide for data science positions. We will be exploring a range of topics and questions commonly encountered in data science interviews. Data science is a dynamic field that involves extracting insights and knowledge from data to drive informed decision-making. As a data science candidate, your ability to showcase your technical skills, problem-solving prowess, and domain knowledge is crucial during the interview process.
Throughout this data scientist interview questions guide, we will cover a variety of areas, including data manipulation, statistical analysis, machine learning, data visualization, data analysis interview questions and more. By the end of this article, you'll have a solid understanding of the types of questions you might face and how to approach answering them effectively.
Now, we will delve into a selection of data science coding interview questions specifically designed for candidates who are new to the field and looking to start their careers in data science. Let's get started data science basic interview questions!
1. What is the difference between data analytics and data science?
|Aspect||Data Analytics||Data Science|
|Purpose||Focuses on analyzing historical data for insights||Aims to extract insights, build models, and innovate|
|Approach||Works with existing data and hypotheses||Utilizes mathematical, statistical, and scientific tools|
|Focus||Present data analysis and interpretation||Addresses futuristic problems and predictive modeling|
|Tools and Methods||Uses fewer statistical and visualization tools||Utilizes a broad range of tools, algorithms, and techniques|
|Decision-making||Supports better decision-making based on history||Drives innovation and answers complex questions|
|Scope||Specific and concentrated problems||Broader and more comprehensive problem-solving|
|Timeframe||Focuses on present and past data||Incorporates historical, current, and future data|
|Insights Generation||Derives meaning from existing historical context||Builds connections and uncovers insights for the future|
2. What are some of the techniques used for sampling? What is the main advantage of sampling?
Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset.
There are majorly two categories of sampling techniques based on the usage of statistics, they are:
3. What does it mean when the p-values are high and low?
A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.
4. Define the terms KPI, lift, model fitting, robustness and DOE.
5. Define and explain selection bias?
The selection bias occurs in the case when the researcher has to make a decision on which participant to study. The selection bias is associated with those researches when the participant selection is not random. The selection bias is also called the selection effect. The selection bias is caused by as a result of the method of sample collection.
Four types of selection bias are explained below:
6. Define bias-variance trade-off?
Answer: Let us first understand the meaning of bias and variance in detail:
Bias: It is a kind of error in a machine learning model when an ML Algorithm is oversimplified. When a model is trained, at that time it makes simplified assumptions so that it can easily understand the target function. Some algorithms that have low bias are Decision Trees, SVM, etc. On the other hand, logistic and linear regression algorithms are the ones with a high bias.
Variance: Variance is also a kind of error. It is introduced into an ML Model when an ML algorithm is made highly complex. This model also learns noise from the data set that is meant for training. It further performs badly on the test data set. This may lead to over lifting as well as high sensitivity.
When the complexity of a model is increased, a reduction in the error is seen. This is caused by the lower bias in the model. But, this does not happen always till we reach a particular point called the optimal point. After this point, if we keep on increasing the complexity of the model, it will be over lifted and will suffer from the problem of high variance.
We can represent bias variance trade off as:
Bias Variance Trade off
7. What is logistic regression? State an example where you have recently used logistic regression.
Answer: Logistic Regression is also known as the logit model. It is a technique to predict the binary outcome from a linear combination of variables (called the predictor variables).
For example, let us say that we want to predict the outcome of elections for a particular political leader. So, we want to find out whether this leader is going to win the election or not. So, the result is binary i.e. win (1) or loss (0). However, the input is a combination of linear variables like the money spent on advertising, the past work done by the leader and the party, etc.
8. In a time interval of 15-minutes, the probability that you may see a shooting star or a bunch of them is 0.2. What is the percentage chance of you seeing at least one star shooting from the sky if you are under it for about an hour?
Let us say that Prob is the probability that we may see a minimum of one shooting star in 15 minutes.
So, Prob = 0.2
Now, the probability that we may not see any shooting star in the time duration of 15 minutes is = 1 - Prob
1-0.2 = 0.8
The probability that we may not see any shooting star for an hour is:
= 0.8 * 0.8 * 0.8 * 0.8 = (0.8)⁴
So, the probability that we will see one shooting star in the time interval of an hour is = 1-0.4 = 0.6
So, there are approximately 60% chances that we may see a shooting star in the time span of an hour.
9. Define the confusion matrix
It is a matrix that has 2 rows and 2 columns. It has 4 outputs that a binary classifier provides to it. It is used to derive various measures like specificity, error rate, accuracy, precision, sensitivity, and recall. The test data set should contain the correct and predicted labels. The labels depend upon the performance. For instance, the predicted labels are the same if the binary classifier performs perfectly. Also, they match the part of observed labels in real-world scenarios. True Positive: This means that the positive prediction is correct.
10. What is a Gradient and Gradient Descent?
Gradient: Gradient is the measure of a property that how much the output has changed with respect to a little change in the input. In other words, we can say that it is a measure of change in the weights with respect to the change in error. The gradient can be mathematically represented as the slope of a function.
Gradient Descent: Gradient descent is a minimization algorithm that minimizes the Activation function. Well, it can minimize any function given to it but it is usually provided with the activation function only.
Gradient descent, as the name suggests means descent or a decrease in something. The analogy of gradient descent is often taken as a person climbing down a hill/mountain. The following is the equation describing what gradient descent means:
So, if a person is climbing down the hill, the next position that the climber has to come to is denoted by “b” in this equation. Then, there is a minus sign because it denotes the minimization (as gradient descent is a minimization algorithm). The Gamma is called a waiting factor and the remaining term which is the Gradient term itself shows the direction of the steepest descent.
Here, we will delve into a set of interview questions on data science specifically crafted for candidates who possess significant experience in the field. Let's dive into the interview questions for data scientists!
1. How are the time series problems different from other regression problems?
Time series data can be thought of as an extension to linear regression which uses terms like autocorrelation, movement of averages for summarizing historical data of y-axis variables for predicting a better future.
2. So, you have done some projects in machine learning and data science and we see you are a bit experienced in the field. Let’s say your laptop’s RAM is only 4GB and you want to train your model on 10GB data set.What will you do? Have you experienced such an issue before?
If I were faced with the situation of having a laptop with only 4GB of RAM and needing to train a model on a 10GB dataset, I would need to carefully consider several strategies to overcome this memory limitation. Here are a few approaches I might consider:
One of the simplest approaches is to sample a subset of the data for training. By selecting a representative sample, you can still train your model on a smaller scale while maintaining diversity in the data. However, this approach might lead to some loss of information and potentially less accurate results.
Evaluate the dataset's features and prioritize those that are most relevant to the task at hand. Removing less important or redundant features can help reduce the memory footprint and speed up training. Libraries like scikit-learn offer methods for feature selection.
Transform and engineer new features from the existing ones. Sometimes, creating meaningful derived features can reduce the complexity of the data and improve model performance.
Use incremental learning techniques to train the model in smaller batches. This involves dividing the data into chunks and training the model iteratively on each chunk. This approach is particularly useful for algorithms that support online learning.
Optimize data preprocessing steps to minimize memory usage. For example, consider loading only the required data columns into memory and saving intermediate results to disk instead of keeping them all in memory.
Utilize cloud-based services with larger computing resources, such as Amazon Web Services (AWS) or Google Cloud, to perform the training. Cloud platforms offer scalable resources and memory capacity that can handle larger datasets.
Reducing Model Complexity:
Use simpler models that require fewer parameters, which can reduce memory requirements during training. For example, you might choose a linear model over a complex neural network.
Apply techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the data. This can help retain important information while reducing memory usage.
Train the model in smaller segments and save checkpoints of the model's progress. This way, if the training process is interrupted due to memory limitations, you can resume training from the last checkpoint.
Consider using pre-trained models for certain tasks and fine-tuning them with your specific data. Transfer learning can reduce the amount of training needed on your end.
Ultimately, the approach chosen would depend on the specific characteristics of the dataset, the machine learning task, and the resources available. It's important to carefully analyze the trade-offs between accuracy, memory usage, and computational time when making a decision.
3. What are Exploding Gradients and Vanishing Gradients?
“Exploding gradients" and "vanishing gradients" are two related issues that can occur during the training of deep neural networks. These problems arise in the context of gradient-based optimization algorithms, such as gradient descent, which are used to update the weights of neural network layers during training. Let's explore both concepts:
Vanishing Gradient and Exploding Gradient
4. Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?
Depending on the size of the dataset, we follow the below ways:
5. How regularly must we update an algorithm in the field of machine learning?
We do not want to update and make changes to an algorithm on a regular basis as an algorithm is a well-defined step procedure to solve any problem and if the steps keep on updating, it cannot be said well defined anymore. Also, this brings in a lot of problems to the systems already implementing the algorithm as it becomes difficult to bring in continuous and regular changes. So, we should update an algorithm only in any of the following cases:
6. Why is data cleaning crucial?
While running an algorithm on any data, to gather proper insights, it is very much necessary to have correct and clean data that contains only relevant information. Dirty data most often results in poor or incorrect insights and predictions which can have damaging effects.
For example, while launching any big campaign to market a product, if our data analysis tells us to target a product that in reality has no demand and if the campaign is launched, it is bound to fail. This results in a loss of the company’s revenue. This is where the importance of having proper and clean data comes into the picture.
7. During analysis, how do you treat the missing values?
To identify the extent of missing values, we first have to identify the variables with the missing values. Let us say a pattern is identified. The analyst should now concentrate on them as it could lead to interesting and meaningful insights. However, if there are no patterns identified, we can substitute the missing values with the median or mean values or we can simply ignore the missing values.
If the variable is categorical, the common strategies for handling missing values include:
It's important to select an appropriate strategy based on the nature of the data and the potential impact on subsequent analysis or modelling.
If 80% of the values are missing for a particular variable, then we would drop the variable instead of treating the missing values.
8. Differentiate between box plot and histogram.
Box plots and histograms are both visualizations used for showing data distributions for efficient communication of information.
Histograms are the bar chart representation of information that represents the frequency of numerical variable values that are useful in estimating probability distribution, variations and outliers.
Boxplots are used for communicating different aspects of data distribution where the shape of the distribution is not seen but still the insights can be gathered. These are useful for comparing multiple charts at the same time as they take less space when compared to histograms.
Histogram and Boxplot
9. How will you balance/correct imbalanced data?
There are different techniques to correct/balance imbalanced data. It can be done by increasing the sample numbers for minority classes. The number of samples can be decreased for those classes with extremely high data points. Following are some approaches followed to balance data:
For example, consider the below graph that illustrates training data:
Here, if we measure the accuracy of the model in terms of getting "0"s, then the accuracy of the model would be very high -> 99.9%, but the model does not guarantee any valuable information. In such cases, we can apply different evaluation metrics as stated above.
10. What are some examples when false positive has proven important than false negative?
Before citing instances, let us understand what are false positives and false negatives.
Some examples where false positives were important than false negatives are:
11. How do you clean data?
Answer: Cleaning data is an essential step in the data preprocessing process, as it ensures that the data used for analysis or modeling is accurate, consistent, and reliable. Here are the general steps and techniques involved in cleaning data:
3. Correct Inaccurate Values: Identify and correct data points that are clearly inaccurate or erroneous. This might involve cross-referencing with external sources or applying domain knowledge.
4. Standardize Data Formats: Ensure consistent formats for data, such as date formats, units of measurement, and categorical values. Inconsistent formatting can lead to misinterpretation.
5. Remove Outliers: Identify and handle outliers—data points that significantly deviate from the rest of the dataset. Depending on the context, outliers can be removed, transformed, or kept based on domain knowledge.
Handle Inconsistent Categorical Data: If categorical data has multiple representations (e.g., "Male," "M," "M "), standardize them to a single representation. Also, consider merging or grouping categories if they're similar.
6. Validate Data Integrity: Check for referential integrity, where relationships between tables or datasets are maintained properly.
7. Check for Typos and Errors: Scrutinize the data for common typos, spelling mistakes, and data entry errors. Regular expressions or automated scripts can help with this.
8. Ensure Data Conformity: Ensure that data adheres to predefined rules or constraints. For example, check that age values are within reasonable ranges.
Validate Data Against Domain Knowledge: Validate the data against your domain knowledge and business rules to identify any anomalies or inconsistencies.
9. Explore Visualizations: Visualize the data using plots, histograms, and other visualizations to identify patterns, anomalies, or odd distributions.
10. Document Changes: Keep a record of the changes you make during data cleaning, as this can help others understand the preprocessing steps taken.
Data cleaning is a iterative process, and you might need to revisit these steps as you analyze or model the data. The specific techniques you use will depend on the nature of your dataset and the goals of your analysis or modeling.
1. Explain the steps in making a decision tree.
Answer: Creating a decision tree involves a series of steps to build a predictive model that makes decisions based on input features. Decision trees are a popular machine learning algorithm used for both classification and regression tasks. Here's a step-by-step breakdown of how to construct a decision tree:
2. How do you build a random forest model?
A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.
3. How should you maintain a deployed model?
Answer: The steps to maintain a deployed model are:
Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it's doing what it's supposed to do.
Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.
The new models are compared to each other to determine which model performs the best.
The best-performing model is re-built on the current state of data.
4. How can outlier values be treated?
Answer: You can drop outliers only if it is a garbage value.
Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed.
If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point.
If you cannot drop outliers, you can try the following:
5. ‘People who bought this also bought…' recommendations seen on Amazon are a result of which algorithm?
The recommendation engine is accomplished with collaborative filtering. Collaborative filtering explains the behavior of other users and their purchase history in terms of ratings, selection, etc.
The engine makes predictions on what might interest a person based on the preferences of other users. In this algorithm, item features are unknown.
For example, a sales page shows that a certain number of people buy a new phone and also buy tempered glass at the same time. Next time, when a person buys a phone, he or she may see a recommendation to buy tempered glass as well.
6. What do you understand about true positive rate and false-positive rate?
True Positive Rate (TPR) defines the probability that an actual positive will turn out to be positive.
The True Positive Rate (TPR) is calculated by taking the ratio of the [True Positives (TP)] and [True Positive (TP) & False Negatives (FN) ].
The formula for the same is stated below -
False Positive Rate (FPR) defines the probability that an actual negative result will be shown as a positive one i.e the probability that a model will generate a false alarm.
The False Positive Rate (FPR) is calculated by taking the ratio of the [False Positives (FP)] and [True Positives (TP) & False Positives(FP)].
The formula for the same is stated below -
7. What is deep learning?
Deep learning is a subfield of machine learning that focuses on the development and use of neural networks to solve complex problems. Neural networks are computational models inspired by the structure and function of the human brain's interconnected neurons. Deep learning involves training these neural networks on large amounts of data to learn patterns and representations that can be used for tasks such as image and speech recognition, natural language processing, and more.
The term "deep" in deep learning refers to the depth of the neural networks, which are composed of multiple layers of interconnected nodes or neurons. Each layer transforms the data it receives and passes it to the next layer, with the final layer producing the desired output. The process of training a deep neural network involves adjusting the weights and biases of the network's connections using optimization algorithms, so that the network can learn to make accurate predictions or classifications.
8. Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?
The K nearest neighbor algorithm can be used because it can compute the nearest neighbor and if it doesn't have a value, it just computes the nearest neighbor based on all the other features.
When you're dealing with K-means clustering or linear regression, you need to do that in your pre-processing, otherwise, they'll crash. Decision trees also have the same problem, although there is some variance.
9. How can you select k for k-means?
We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where 'k' is the number of clusters.
Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.
10. What are recommender systems?
A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:
As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: "Users who bought this also bought…"
As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.
In conclusion, preparing for data science technical interview questions is crucial to stand out and demonstrate your expertise in this dynamic and evolving field. The questions typically cover a wide range of topics, from fundamental concepts to practical applications. By mastering the essential concepts such as data science processes, machine learning techniques, data handling strategies, and problem-solving approaches, you can confidently tackle a variety of questions that may arise during your interviews.
Remember that while technical knowledge is essential, interviewers are also interested in how you think critically, approach real-world problems, and communicate your solutions effectively. Practicing your answers, working through case studies, and discussing your thought processes out loud can help you refine your responses and showcase your analytical skills.
Furthermore, tailoring your preparation to the specific company and role you're applying for can give you a competitive edge. Researching the company's projects, industry focus, and the specific skills they're looking for will allow you to address their needs more directly in your responses.
Ultimately, an interview is an opportunity to demonstrate your passion for data science, your ability to handle complex challenges, and your potential as a valuable contributor to a data-driven organization. By thoroughly preparing for a range of data science interview questions, you can enter your interviews with confidence and increase your chances of success.
If you are interested in mastering all the Data science concepts in detail to ace your interviews with ease and land your dream job - Do look out for our free Data Science Tutorial and Full Stack Data Science Course.