Relates to the Difficulty in Finding Balance Between Competing Demands of Family and Career Quizlet
Introduction
The idea of creating machines which learn by themselves has been driving humans for decades now. For fulfilling that dream, unsupervised learning and clustering is the key. Unsupervised learning provides more flexibility, but is more challenging also.
Clustering plays an of import role to depict insights from unlabeled data. It classifies the information in similar groups which improves diverse business decisions by providing a meta understanding.
In this skill test, nosotros tested our community on clustering techniques. A full of 1566 people registered in this skill test. If you missed taking the test, here is your opportunity for you to find out how many questions you lot could accept answered correctly.
If you are only getting started with Unsupervised Learning, here are some comprehensive resources to assist you lot in your journeying:
- Machine Learning Certification Course for Beginners
-
The Most Comprehensive Guide to K-Ways Clustering You lot'll Always Need
- Certified AI & ML Blackbelt+ Programme
Overall Results
Below is the distribution of scores, this will help y'all evaluate your performance:
Y'all can admission your performance here. More than 390 people participated in the skill test and the highest score was 33. Here are a few statistics nearly the distribution.
Overall distribution
Mean Score: xv.eleven
Median Score: xv
Mode Score: 16
Helpful Resources
An Introduction to Clustering and different methods of clustering
Getting your clustering right (Function I)
Getting your clustering right (Office Ii)
Questions & Answers
Q1. Movie Recommendation systems are an instance of:
- Classification
- Clustering
- Reinforcement Learning
- Regression
Options:
B. A. 2 Just
C. one and ii
D. one and 3
Eastward. 2 and iii
F. 1, 2 and 3
H. one, ii, 3 and 4
Solution: (E)
Generally, flick recommendation systems cluster the users in a finite number of similar groups based on their previous activities and profile. Then, at a fundamental level, people in the same cluster are fabricated similar recommendations.
In some scenarios, this can also exist approached as a nomenclature problem for assigning the most appropriate movie class to the user of a specific grouping of users. Likewise, a movie recommendation arrangement can be viewed as a reinforcement learning trouble where it learns past its previous recommendations and improves the future recommendations.
Q2. Sentiment Analysis is an example of:
- Regression
- Classification
- Clustering
- Reinforcement Learning
Options:
A. 1 Only
B. one and 2
C. 1 and three
D. ane, 2 and 3
Eastward. ane, ii and 4
F. 1, 2, 3 and four
Solution: (E)
Sentiment analysis at the central level is the task of classifying the sentiments represented in an prototype, text or speech communication into a set of defined sentiment classes like happy, lamentable, excited, positive, negative, etc. Information technology tin can also exist viewed as a regression problem for assigning a sentiment score of say one to ten for a corresponding image, text or speech.
Another way of looking at sentiment analysis is to consider information technology using a reinforcement learning perspective where the algorithm constantly learns from the accuracy of by sentiment analysis performed to meliorate the future performance.
Q3. Tin conclusion trees exist used for performing clustering?
A. Truthful
B. Imitation
Solution: (A)
Decision trees can also be used to for clusters in the data merely clustering oft generates natural clusters and is non dependent on any objective office.
Q4. Which of the following is the most advisable strategy for information cleaning before performing clustering analysis, given less than desirable number of information points:
- Capping and flouring of variables
- Removal of outliers
Options:
A. 1 merely
B. two only
C. 1 and ii
D. None of the above
Solution: (A)
Removal of outliers is not recommended if the data points are few in number. In this scenario, capping and flouring of variables is the well-nigh appropriate strategy.
Q5. What is the minimum no. of variables/ features required to perform clustering?
A. 0
B. 1
C. 2
D. iii
Solution: (B)
At to the lowest degree a single variable is required to perform clustering assay. Clustering analysis with a single variable tin can be visualized with the help of a histogram.
Q6. For ii runs of Thou-Mean clustering is information technology expected to get aforementioned clustering results?
A. Yes
B. No
Solution: (B)
M-Means clustering algorithm instead converses on local minima which might also correspond to the global minima in some cases but non always. Therefore, it'due south advised to run the Grand-Means algorithm multiple times earlier drawing inferences near the clusters.
Still, note that it'southward possible to receive same clustering results from Grand-ways by setting the same seed value for each run. But that is washed by simply making the algorithm choose the set up of aforementioned random no. for each run.
Q7. Is it possible that Assignment of observations to clusters does not alter between successive iterations in G-Means
A. Yes
B. No
C. Tin't say
D. None of these
Solution: (A)
When the Grand-Means algorithm has reached the local or global minima, it volition non alter the assignment of data points to clusters for two successive iterations.
Q8. Which of the post-obit tin deed equally possible termination weather in One thousand-Means?
- For a fixed number of iterations.
- Consignment of observations to clusters does not modify betwixt iterations. Except for cases with a bad local minimum.
- Centroids exercise not change between successive iterations.
- Terminate when RSS falls beneath a threshold.
Options:
A. 1, three and iv
B. 1, 2 and 3
C. 1, 2 and four
D. All of the above
Solution: (D)
All iv conditions can be used as possible termination condition in 1000-Means clustering:
- This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will exist poor because of an insufficient number of iterations.
- Except for cases with a bad local minimum, this produces a good clustering, simply runtimes may be unacceptably long.
- This also ensures that the algorithm has converged at the minima.
- Stop when RSS falls beneath a threshold. This benchmark ensures that the clustering is of a desired quality after termination. Practically, information technology's a proficient practise to combine it with a bound on the number of iterations to guarantee termination.
Q9. Which of the post-obit clustering algorithms suffers from the problem of convergence at local optima?
- K- Means clustering algorithm
- Agglomerative clustering algorithm
- Expectation-Maximization clustering algorithm
- Diverse clustering algorithm
Options:
A. 1 only
B. 2 and 3
C. 2 and 4
D. 1 and 3
Eastward. ane,2 and 4
F. All of the above
Solution: (D)
Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has the drawback of converging at local minima.
Q10. Which of the following algorithm is most sensitive to outliers?
A. K-means clustering algorithm
B. K-medians clustering algorithm
C. K-modes clustering algorithm
D. K-medoids clustering algorithm
Solution: (A)
Out of all the options, K-Ways clustering algorithm is nigh sensitive to outliers equally it uses the mean of cluster data points to detect the cluster middle.
Q11. After performing Thou-Means Clustering analysis on a dataset, you observed the following dendrogram. Which of the following conclusion can be drawn from the dendrogram?
A. There were 28 data points in clustering analysis
B. The best no. of clusters for the analyzed data points is 4
C. The proximity function used is Boilerplate-link clustering
D. The in a higher place dendrogram interpretation is not possible for K-Means clustering analysis
Solution: (D)
A dendrogram is not possible for K-Means clustering analysis. However, i can create a cluster gram based on 1000-Means clustering analysis.
Q12. How can Clustering (Unsupervised Learning) be used to improve the accurateness of Linear Regression model (Supervised Learning):
- Creating different models for unlike cluster groups.
- Creating an input feature for cluster ids as an ordinal variable.
- Creating an input feature for cluster centroids every bit a continuous variable.
- Creating an input feature for cluster size equally a continuous variable.
Options:
A. i only
B. 1 and ii
C. one and 4
D. 3 only
E. two and four
F. All of the above
Solution: (F)
Creating an input feature for cluster ids equally ordinal variable or creating an input feature for cluster centroids as a continuous variable might not convey any relevant data to the regression model for multidimensional data. But for clustering in a single dimension, all of the given methods are expected to convey meaningful data to the regression model. For example, to cluster people in 2 groups based on their hair length, storing clustering ID as ordinal variable and cluster centroids equally continuous variables will convey meaningful information.
Q13. What could exist the possible reason(s) for producing two different dendrograms using agglomerative clustering algorithm for the aforementioned dataset?
A. Proximity function used
B. of data points used
C. of variables used
D. B and c only
E. All of the above
Solution: (E)
Change in either of Proximity function, no. of data points or no. of variables will lead to unlike clustering results and hence different dendrograms.
Q14. In the figure below, if you draw a horizontal line on y-axis for y=2. What will be the number of clusters formed?
A. 1
B. ii
C. iii
D. 4
Solution: (B)
Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram are 2, therefore, two clusters will be formed.
Q15. What is the most appropriate no. of clusters for the data points represented by the post-obit dendrogram:
A. 2
B. iv
C. half-dozen
D. 8
Solution: (B)
The decision of the no. of clusters that tin can best depict unlike groups can be chosen by observing the dendrogram. The best selection of the no. of clusters is the no. of vertical lines in the dendrogram cutting by a horizontal line that tin can transverse the maximum distance vertically without intersecting a cluster.
In the to a higher place example, the all-time selection of no. of clusters will be iv as the red horizontal line in the dendrogram below covers maximum vertical distance AB.
Q16. In which of the following cases will K-Means clustering neglect to give adept results?
- Data points with outliers
- Data points with dissimilar densities
- Data points with round shapes
- Data points with not-convex shapes
Options:
A. 1 and ii
B. 2 and 3
C. two and iv
D. 1, 2 and 4
Eastward. ane, 2, 3 and four
Solution: (D)
K-Means clustering algorithm fails to requite good results when the data contains outliers, the density spread of data points across the data space is different and the information points follow non-convex shapes.
Q17. Which of the following metrics, practice we have for finding dissimilarity between two clusters in hierarchical clustering?
- Single-link
- Complete-link
- Average-link
Options:
A. 1 and 2
B. i and 3
C. two and 3
D. 1, 2 and three
Solution: (D)
All of the three methods i.e. single link, consummate link and boilerplate link can exist used for finding dissimilarity between 2 clusters in hierarchical clustering.
Q18. Which of the post-obit are true?
- Clustering analysis is negatively afflicted past multicollinearity of features
- Clustering analysis is negatively afflicted by heteroscedasticity
Options:
A. 1 simply
B. two simply
C. i and ii
D. None of them
Solution: (A)
Clustering analysis is non negatively affected by heteroscedasticity only the results are negatively impacted by multicollinearity of features/ variables used in clustering as the correlated feature/ variable volition behave extra weight on the distance adding than desired.
Q19. Given, six points with the following attributes:
Which of the following clustering representations and dendrogram depicts the utilise of MIN or Single link proximity function in hierarchical clustering:
A.
B.
C.
D.
Solution: (A)
For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined to be the minimum of the distance between any two points in the unlike clusters. For instance, from the tabular array, we see that the distance between points 3 and half dozen is 0.11, and that is the height at which they are joined into ane cluster in the dendrogram. Equally another instance, the altitude betwixt clusters {iii, half dozen} and {ii, five} is given by dist({three, 6}, {two, 5}) = min(dist(iii, 2), dist(6, two), dist(3, 5), dist(6, five)) = min(0.1483, 0.2540, 0.2843, 0.3921) = 0.1483.
Q20 Given, six points with the following attributes:
Which of the following clustering representations and dendrogram depicts the use of MAX or Complete link proximity function in hierarchical clustering:
A.
B.
C.
D.
Solution: (B)
For the single link or MAX version of hierarchical clustering, the proximity of two clusters is defined to be the maximum of the distance between any two points in the dissimilar clusters. Similarly, hither points 3 and 6 are merged kickoff. Nonetheless, {three, six} is merged with {4}, instead of {2, five}. This is because the dist({iii, half dozen}, {iv}) = max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) = 0.2216, which is smaller than dist({iii, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, five), dist(6, 5)) = max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3, 6}, {ane}) = max(dist(3, 1), dist(half-dozen, 1)) = max(0.2218, 0.2347) = 0.2347.
Q21 Given, six points with the post-obit attributes:
Which of the following clustering representations and dendrogram depicts the use of Group boilerplate proximity role in hierarchical clustering:
A.
B.
C.
D.
Solution: (C)
For the group average version of hierarchical clustering, the proximity of two clusters is divers to be the average of the pairwise proximities betwixt all pairs of points in the different clusters. This is an intermediate approach between MIN and MAX. This is expressed by the following equation:
Here, the altitude betwixt some clusters. dist({3, 6, iv}, {1}) = (0.2218 + 0.3688 + 0.2347)/(3 ∗ 1) = 0.2751. dist({ii, v}, {one}) = (0.2357 + 0.3421)/(2 ∗ 1) = 0.2889. dist({3, half dozen, iv}, {2, 5}) = (0.1483 + 0.2843 + 0.2540 + 0.3921 + 0.2042 + 0.2932)/(half-dozen∗i) = 0.2637. Because dist({3, six, 4}, {ii, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at the fourth stage
Q22. Given, six points with the following attributes:
Which of the following clustering representations and dendrogram depicts the use of Ward's method proximity function in hierarchical clustering:
A.
B.
C.
D.
Solution: (D)
Ward method is a centroid method. Centroid method calculates the proximity betwixt two clusters past calculating the distance betwixt the centroids of clusters. For Ward'due south method, the proximity between two clusters is defined as the increase in the squared fault that results when ii clusters are merged. The results of applying Ward'south method to the sample information set of six points. The resulting clustering is somewhat different from those produced by MIN, MAX, and grouping boilerplate.
Q23. What should be the best choice of no. of clusters based on the following results:
A. 1
B. 2
C. 3
D. four
Solution: (C)
The silhouette coefficient is a mensurate of how like an object is to its own cluster compared to other clusters. Number of clusters for which silhouette coefficient is highest represents the best choice of the number of clusters.
Q24. Which of the following is/are valid iterative strategy for treating missing values before clustering analysis?
A. Imputation with mean
B. Nearest Neighbor assignment
C. Imputation with Expectation Maximization algorithm
D. All of the above
Solution: (C)
All of the mentioned techniques are valid for treating missing values before clustering analysis but only imputation with EM algorithm is iterative in its functioning.
Q25. Yard-Mean algorithm has some limitations. One of the limitation it has is, it makes hard assignments(A point either completely belongs to a cluster or not belongs at all) of points to clusters.
Note: Soft assignment can be consider as the probability of being assigned to each cluster: say K = three and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
Which of the post-obit algorithm(s) allows soft assignments?
- Gaussian mixture models
- Fuzzy K-ways
Options:
A. one only
B. 2 but
C. one and ii
D. None of these
Solution: (C)
Both, Gaussian mixture models and Fuzzy One thousand-ways allows soft assignments.
Q26. Presume, yous want to cluster seven observations into three clusters using K-Means clustering algorithm. After showtime iteration clusters, C1, C2, C3 has following observations:
C1: {(two,2), (4,four), (half dozen,6)}
C2: {(0,4), (iv,0)}
C3: {(v,five), (9,9)}
What will exist the cluster centroids if you desire to proceed for second iteration?
A. C1: (iv,4), C2: (2,2), C3: (7,7)
B. C1: (6,6), C2: (4,four), C3: (9,9)
C. C1: (2,ii), C2: (0,0), C3: (5,5)
D. None of these
Solution: (A)
Finding centroid for data points in cluster C1 = ((ii+4+6)/iii, (two+4+6)/three) = (4, four)
Finding centroid for data points in cluster C2 = ((0+four)/ii, (4+0)/2) = (two, 2)
Finding centroid for information points in cluster C3 = ((5+9)/2, (five+9)/two) = (7, 7)
Hence, C1: (4,4), C2: (2,ii), C3: (7,7)
Q27. Assume, you want to cluster 7 observations into 3 clusters using 1000-Means clustering algorithm. After first iteration clusters, C1, C2, C3 has following observations:
C1: {(two,2), (iv,iv), (vi,6)}
C2: {(0,4), (4,0)}
C3: {(5,five), (9,9)}
What volition be the Manhattan distance for ascertainment (9, 9) from cluster centroid C1. In second iteration.
A. 10
B. 5*sqrt(ii)
C. xiii*sqrt(2)
D. None of these
Solution: (A)
Manhattan distance between centroid C1 i.e. (4, 4) and (ix, 9) = (nine-4) + (9-4) = 10
Q28. If 2 variables V1 and V2, are used for clustering. Which of the following are true for K means clustering with m =three?
- If V1 and V2 has a correlation of 1, the cluster centroids will be in a direct line
- If V1 and V2 has a correlation of 0, the cluster centroids will be in straight line
Options:
A. 1 only
B. ii simply
C. 1 and ii
D. None of the to a higher place
Solution: (A)
If the correlation between the variables V1 and V2 is 1, then all the data points will be in a straight line. Hence, all the three cluster centroids volition grade a straight line too.
Q29. Feature scaling is an important step before applying One thousand-Mean algorithm. What is reason behind this?
A. In distance calculation it will give the same weights for all features
B. You e'er get the aforementioned clusters. If you lot use or don't use feature scaling
C. In Manhattan distance information technology is an important step but in Euclidian information technology is non
D. None of these
Solution; (A)
Feature scaling ensures that all the features get same weight in the clustering analysis. Consider a scenario of clustering people based on their weights (in KG) with range 55-110 and tiptop (in inches) with range five.six to 6.4. In this case, the clusters produced without scaling can exist very misleading equally the range of weight is much higher than that of height. Therefore, its necessary to bring them to same scale and so that they have equal weightage on the clustering result.
Q30. Which of the post-obit method is used for finding optimal of cluster in K-Hateful algorithm?
A. Elbow method
B. Manhattan method
C. Ecludian mehthod
D. All of the to a higher place
Eastward. None of these
Solution: (A)
Out of the given options, only elbow method is used for finding the optimal number of clusters. The elbow method looks at the percent of variance explained as a office of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much amend modeling of the information.
Q31. What is true about K-Hateful Clustering?
- M-means is extremely sensitive to cluster center initializations
- Bad initialization can pb to Poor convergence speed
- Bad initialization can pb to bad overall clustering
Options:
A. 1 and iii
B. i and 2
C. 2 and three
D. 1, ii and three
Solution: (D)
All 3 of the given statements are truthful. Yard-means is extremely sensitive to cluster center initialization. Also, bad initialization can lead to Poor convergence speed as well equally bad overall clustering.
Q32. Which of the following can exist applied to get adept results for K-ways algorithm corresponding to global minima?
- Endeavour to run algorithm for different centroid initialization
- Adjust number of iterations
- Find out the optimal number of clusters
Options:
A. two and 3
B. 1 and 3
C. i and 2
D. All of above
Solution: (D)
All of these are standard practices that are used in order to obtain good clustering results.
Q33. What should be the all-time choice for number of clusters based on the following results:
A. 5
B. six
C. xiv
D. Greater than 14
Solution: (B)
Based on the above results, the all-time choice of number of clusters using elbow method is 6.
Q34. What should be the best option for number of clusters based on the following results:
A. 2
B. 4
C. six
D. 8
Solution: (C)
Generally, a college average silhouette coefficient indicates better clustering quality. In this plot, the optimal clustering number of grid cells in the written report area should be 2, at which the value of the average silhouette coefficient is highest. However, the SSE of this clustering solution (k = ii) is too large. At k = 6, the SSE is much lower. In improver, the value of the average silhouette coefficient at k = 6 is also very high, which is just lower than 1000 = 2. Thus, the best selection is chiliad = vi.
Q35. Which of the following sequences is correct for a K-Means algorithm using Forgy method of initialization?
- Specify the number of clusters
- Assign cluster centroids randomly
- Assign each data betoken to the nearest cluster centroid
- Re-assign each point to nearest cluster centroids
- Re-compute cluster centroids
Options:
A. 1, 2, 3, five, 4
B. 1, 3, ii, iv, 5
C. 2, 1, iii, 4, 5
D. None of these
Solution: (A)
The methods used for initialization in K means are Forgy and Random Partition. The Forgy method randomly chooses k observations from the data prepare and uses these every bit the initial means. The Random Partition method start randomly assigns a cluster to each ascertainment and then gain to the update step, thus calculating the initial hateful to be the centroid of the cluster's randomly assigned points.
Q36. If y'all are using Multinomial mixture models with the expectation-maximization algorithm for clustering a set of information points into ii clusters, which of the assumptions are important:
A. All the data points follow two Gaussian distribution
B. All the data points follow n Gaussian distribution (north >two)
C. All the data points follow ii multinomial distribution
D. All the information points follow n multinomial distribution (n >2)
Solution: (C)
In EM algorithm for clustering its essential to choose the same no. of clusters to classify the data points into as the no. of different distributions they are expected to be generated from and also the distributions must be of the aforementioned type.
Q37. Which of the following is/are not true about Centroid based Grand-Means clustering algorithm and Distribution based expectation-maximization clustering algorithm:
- Both starts with random initializations
- Both are iterative algorithms
- Both have strong assumptions that the data points must fulfill
- Both are sensitive to outliers
- Expectation maximization algorithm is a special case of G-Means
- Both requires prior knowledge of the no. of desired clusters
- The results produced by both are non-reproducible.
Options:
A. i only
B. five only
C. i and 3
D. 6 and 7
E. 4, 6 and 7
F. None of the to a higher place
Solution: (B)
All of the above statements are true except the 5th as instead K-Means is a special case of EM algorithm in which only the centroids of the cluster distributions are calculated at each iteration.
Q38. Which of the following is/are not true virtually DBSCAN clustering algorithm:
- For data points to be in a cluster, they must be in a distance threshold to a core point
- It has strong assumptions for the distribution of data points in dataspace
- Information technology has substantially loftier time complexity of order O(due northiii)
- Information technology does not crave prior knowledge of the no. of desired clusters
- It is robust to outliers
Options:
A. 1 only
B. two but
C. 4 just
D. 2 and three
Due east. one and 5
F. one, 3 and 5
Solution: (D)
- DBSCAN can grade a cluster of any arbitrary shape and does not have strong assumptions for the distribution of data points in the dataspace.
- DBSCAN has a low fourth dimension complication of order O(n log northward) just.
Q39. Which of the post-obit are the high and depression premises for the existence of F-Score?
A. [0,1]
B. (0,ane)
C. [-1,one]
D. None of the above
Solution: (A)
The everyman and highest possible values of F score are 0 and 1 with 1 representing that every data point is assigned to the correct cluster and 0 representing that the precession and/ or call up of the clustering assay are both 0. In clustering analysis, high value of F score is desired.
Q40. Following are the results observed for clustering 6000 data points into three clusters: A, B and C:
What is the Fi-Score with respect to cluster B?
A. iii
B. four
C. 5
D. 6
Solution: (D)
Hither,
True Positive, TP = 1200
Truthful Negative, TN = 600 + 1600 = 2200
False Positive, FP = 1000 + 200 = 1200
Faux Negative, FN = 400 + 400 = 800
Therefore,
Precision = TP / (TP + FP) = 0.5
Call back = TP / (TP + FN) = 0.6
Hence,
F1 = ii * (Precision * Call back)/ (Precision + recall) = 0.54 ~ 0.5
Cease Notes
I hope you enjoyed taking the test and institute the solutions helpful. The test focused on conceptual as well as practical cognition of clustering fundamentals and its various techniques.
I tried to clear all your doubts through this article, but if nosotros have missed out on something then let us know in comments beneath. Also, If y'all take any suggestions or improvements y'all call up we should brand in the next skilltest, you lot tin let the states know by dropping your feedback in the comments department.
Learn, compete, hack and get hired!
Source: https://www.analyticsvidhya.com/blog/2017/02/test-data-scientist-clustering/
0 Response to "Relates to the Difficulty in Finding Balance Between Competing Demands of Family and Career Quizlet"
Postar um comentário