What is the difference inbetween Gegevens Analytics, Gegevens Analysis, Gegevens Mining, Gegevens Science, Machine Learning, and Big Gegevens?
I had bot wanting to take a stab at this one since a few days, but it always looked like an enormous task, because this question has used too many words. Te addition, this is a question on which a lotsbestemming of people have their eyes, and a lotsbestemming of others have already written elaborate answers.
Let mij very first re-order all the significant words:
Imagine that you want to become a gegevens scientist, and work te a big organization like Amazon, Intel, Google, FB, Apple and so on.
How would that look like?
- You would have to overeenkomst withbig gegevens, you would have to write laptop programs ter SQL, Python, R, C++, Java, Scala, Ruby…and so on, to only maintain big-data databases. You would be called a database manager.
- Spil an engineer working on process control, or someone wanting to streamline operations of the company, you would performGegevens Mining, andGegevens Analysis, You may use plain software to do this where you would only run a lotsbestemming of codes written by others, or you may be writing your elaborate codes te SQL, Python, R and you would be doing gegevens mining, gegevens cleaning, gegevens analysis, modeling, predictive modeling and so on.
- All this will be calledAnalytics. Several software exist to do this. One popular one is Tableau. Some others are JMP and Schutsluisje. Loterijlot of people do everything online where a Vruchtensap based business intelligence setup can be used. Here, plain reporting can be done lightly.
- Further, you would then be able to use machine learning to derive conclusions, and come up with predictions, wherever analytical answers are not possible. Think of analytical answers spil [If/then] type of pc programs, where all the input conditions are already known, and only a few parameters switch.
- Machine learning uses statistical analysis to partition gegevens. An example would be this: Read the comments written by various people on Yelp, and predict from the comments whether the person would have marked a restaurant Four starlet or Five strak.
- If that is not enough, you would be able to use deep learning spil well. Deep learning is used to process gegevens such spil musical files, pictures, even text gegevens such spil natural languages, where gegevens are enormous, but their type is very diverse.
- You would use everything to your advantage
analytical solutions, partitioning gegevens, hacking mindset, automation by programming, reporting, deriving conclusions, making decisions, taking deeds, and telling stories about your gegevens.
- Read also:
- Rohit Malshe’s reaction to How do I learn machine learning?
- Rohit Malshe’s response to How should I begin learning Python?
- Rohit Malshe’s response to What is deep learning? Why is this a growing trend ter machine learning? Why not use SVMs?
- Rohit Malshe’s response to Are ‘curated paths to a Gegevens Science career’ on Coursera worth the money and time?
Ter all the seriousness, if you want a elaborate documentation on all this, I would suggest, go ahead and read this McKinsey report to get a utter understanding. I only extracted a few sections out of it conveniently because I only wished to add on the top of someone else’s skill, and waterput together thesis concepts like a story so spil to inspire the people to think about this subject and start their own journeys.
I will reaction a few questions step by step, and wherever possible, I will give a few pictures, or plots to voorstelling you how things look like.
McKinsey consultants! You are amazing, so if you read things written te this response that were typed by you at some point te time, I give utter credit to you.
- What do wij mean by ",big gegevens",?
- “Big data” refers to datasets whose size is beyond the capability of typical database software devices to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be te order to be considered big data—i.e., wij need not define big gegevens ter terms of being larger than a certain number of terabytes (thousands of gigabytes). Wij assume that, spil technology advances overheen time, the size of datasets that qualify spil big gegevens will also increase. Also note that the definition can vary by sector, depending on what kinds of software devices are commonly available and what sizes of datasets are common te a particular industry. With those caveats, big gegevens te many sectors today will range from a few dozen terabytes to numerous petabytes (thousands of terabytes).
- What is a typical size of gegevens I may have to overeenkomst with? Sometimes GBs, sometimes just a few MBs, sometimes up to spil high spil 1TB. Sometimes the complexity is nothing. The gegevens may be indicating the same thing. Sometimes the complexity can be very high. I might have a giant verkeersopstopping total of a loterijlot of gegevens and logs which can be structured or unstructured.
- Think for example about Macy’s. There are thousands of stores, selling thousands of items vanaf day to millions of customers. If Macy’s wants to derive a conclusion
should they rather diversify te boots, or should they rather diversify te women’s purses? How would they make this decision?
Let us now talk about analysis: This is big part of being a gegevens scientist.
- Technics FOR ANALYZING BIG Gegevens
- There are many mechanisms that draw on disciplines such spil statistics and pc science (particularly machine learning) that can be used to analyze datasets. This list is by no means exhaustive. Indeed, researchers proceed to develop fresh mechanisms and improve on existing ones, particularly ter response to the need to analyze fresh combinations of gegevens.
- Also, note that not all of thesis mechanisms stringently require the use of big data—some of them can be applied effectively to smaller datasets (e.g., A/B testing, regression analysis). However, all of the technics listed here can be applied to big gegevens and, te general, larger and more diverse datasets can be used to generate more numerous and insightful results than smaller, less diverse ones.
- A/B testing. A mechanism te which a control group is compared with a multiplicity of test groups te order to determine what treatments (i.e., switches) will improve a given objective variable, e.g., marketing response rate. This technology is also known spil split testing or bucket testing. An example application is determining what copy text, layouts, photos, or colors will improve conversion rates on an e-commerce Web webpagina. Big gegevens enables ample numbers of tests to be executed and analyzed, ensuring that groups are of sufficient size to detect meaningful (i.e., statistically significant) differences inbetween the control 28 and treatment groups (see statistics). When more than one variable is at the same time manipulated ter the treatment, the multivariate generalization of this technology, which applies statistical modeling, is often called “A/B/N” testing. What would an example look like?
- Imagine that Coke signs up with Facebook to work on marketing and sales. Facebook would waterput advertisements according to the customers. It can create versions of advertisements. Not all versions will suit to every geography. Some will suit to USA, some will suit to India. Some can suit to Indians living te USA. What Facebook can do is to choose a subset of people from a massive pool, and pass advertisements to them ter their feed according to whether those people love food or not. For each advertisement, Facebook will collect the responses and accordingly determine which advertisement does better, and on a larger pool of people it will use a better one. Does gegevens science let someone determine better what the response should be? Absolutely!
- Association rule learning. A set of technologies for discovering interesting relationships, i.e., “association rules,” among variables te large databases. Thesis technologies consist of a multiplicity of algorithms to generate and test possible rules. One application is market basket analysis, te which a retailer can determine which products are frequently bought together and use this information for marketing (a commonly cited example is the discovery that many supermarket shoppers who buy diapers also tend to buy teddybeer).
- Classification. A set of mechanisms to identify the categories te which fresh gegevens points belong, based on a training set containing gegevens points that have already bot categorized. One application is the prediction of segment-specific customer behavior (e.g., buying decisions, churn rate, consumption rate) where there is a clear hypothesis or objective outcome. Thesis technologies are often described spil supervised learning because of the existence of a training set, they stand te tegenstelling to cluster analysis, a type of unsupervised learning.
- Cluster analysis. A statistical method for classifying objects that splits a diverse group into smaller groups of similar objects, whose characteristics of similarity are not known ter advance. An example of cluster analysis is segmenting consumers into self-similar groups for targeted marketing. This is a type of unsupervised learning because training gegevens are not used. This technology is te tegenstelling to classification, a type of supervised learning.
- Crowdsourcing. A mechanism for collecting gegevens submitted by a large group of people or community (i.e., the “crowd”) through an open call, usually through networked media such spil the Web.28 This is a type of mass collaboration and an example of using Web Two.0.29 Gegevens fusion and gegevens integration.
- A set of technics that integrate and analyze gegevens from numerous sources te order to develop insights ter ways that are more efficient and potentially more accurate than if they were developed by analyzing a single source of gegevens.
- Gegevens mining. A set of mechanisms to samenvatting patterns from large datasets by combining methods from statistics and machine learning with database management. Thesis mechanisms include association rule learning, cluster analysis, classification, and regression. Applications include mining customer gegevens to determine segments most likely to react to an suggest, mining human resources gegevens to identify characteristics of most successful employees, or market basket analysis to proefje the purchase behavior of customers.
- Ensemble learning. Using numerous predictive models (each developed using statistics and/or machine learning) to obtain better predictive voorstelling than could be obtained from any of the constituent models. This is a type of supervised learning.
- Genetic algorithms. A technology used for optimization that is inspired by the process of natural evolution or “survival of the fittest.” Te this mechanism, potential solutions are encoded spil “chromosomes” that can combine and mutate. Thesis individual chromosomes are selected for survival within a modeled “environment” that determines the fitness or spectacle of each individual te the population. Often described spil a type of “evolutionary algorithm,” thesis algorithms are well-suited for solving nonlinear problems. Examples of applications include improving job scheduling ter manufacturing and optimizing the voorstelling of an investment portfolio.
- Machine learning. A subspecialty of pc science (within a field historically called “artificial intelligence”) worried with the vormgeving and development of algorithms that permit computers to evolve behaviors based on empirical gegevens. A major concentrate of machine learning research is to automatically learn to recognize ingewikkeld patterns and make slim decisions based on gegevens. Natural language processing is an example of machine learning.
- Natural language processing (NLP). A set of technologies from a sub-specialty of laptop science (within a field historically called “artificial intelligence”) and linguistics that uses rekentuig algorithms to analyze human (natural) language. Many NLP technics are types of machine learning. One application of NLP is using sentiment analysis on social media to determine how prospective customers are reacting to a branding campaign. Gegevens from social media, analyzed by natural language processing, can be combined with real-time sales gegevens, ter order to determine what effect a marketing campaign is having on customer sentiment and purchasing behavior.
- Neural networks. Computational models, inspired by the structure and workings of biological neural networks (i.e., the cells and connections within a brain), that find patterns te gegevens. Neural networks are well-suited for finding nonlinear patterns. They can be used for pattern recognition and optimization. Some neural network applications involve supervised learning and others involve unsupervised learning. Examples of applications include identifying high-value customers that are at risk of leaving a particular company and identifying fraudulent insurance claims.
- Network analysis. A set of technologies used to characterize relationships among discrete knots te a graph or a network. Te social network analysis, connections inbetween individuals te a community or organization are analyzed, e.g., how information travels, or who has the most influence overheen whom. Examples of applications include identifying key opinion leaders to target for marketing, and identifying bottlenecks ter enterprise information flows.
- Optimization. A portfolio of numerical technologies used to redesign ingewikkeld systems and processes to improve their spectacle according to one or more objective measures (e.g., cost, speed, or reliability). Examples of applications include improving operational processes such spil scheduling, routing, and floor layout, and making strategic decisions such spil product range strategy, linked investment analysis, and R&,D portfolio strategy. Genetic algorithms are an example of an optimization technology. Same way, mixed oprecht programming is another way.
- Pattern recognition. A set of machine learning mechanisms that assign some sort of output value (or label) to a given input value (or example) according to a specific algorithm. Classification mechanisms are an example.
- Predictive modeling. A set of technologies ter which a mathematical prototype is created or chosen to best predict the probability of an outcome. An example of an application ter customer relationship management is the use of predictive models to estimate the likelihood that a customer will “churn” (i.e., switch providers) or the likelihood that a customer can be cross-sold another product. Regression is one example of the many predictive modeling mechanisms.
- Regression. A set of statistical technologies to determine how the value of the dependent variable switches when one or more independent variables is modified. Often used for forecasting or prediction. Examples of applications include forecasting sales volumes based on various market and economic variables or determining what measurable manufacturing parameters most influence customer satisfaction. Used for gegevens mining.
- Sentiment analysis. Application of natural language processing and other analytic technics to identify and samenvatting subjective information from source text material. Key aspects of thesis analyses include identifying the feature, facet, or product about which a sentiment is being voiced, and determining the type, “polarity” (i.e., positive, negative, or neutral) and the degree and strength of the sentiment. Examples of applications include companies applying sentiment analysis to analyze social media (e.g., blogs, microblogs, and social networks) to determine how different customer segments and stakeholders are reacting to their products and deeds.
- Signal processing. A set of technologies from electrical engineering and applied mathematics originally developed to analyze discrete and continuous signals, i.e., representations of analog physical quantities (even if represented digitally) such spil radio signals, sounds, and pics. This category includes technologies from signal detection theory, which quantifies the capability to discern inbetween signal and noise. Sample applications include modeling for time series analysis or implementing gegevens fusion to determine a more precise reading by combining gegevens from a set of less precise gegevens sources (i.e., extracting the signal from the noise). Signal processing technologies can be used to implement some types of gegevens fusion. One example of an application is sensor gegevens from the Internet of Things being combined to develop an integrated perspective on the voorstelling of a ingewikkeld distributed system such spil an oil refinery.
- Spatial analysis. A set of technics, some applied from statistics, which analyze the topological, geometric, or geographic properties encoded te a gegevens set. Often the gegevens for spatial analysis come from geographic information systems (GIS) that capture gegevens including location information, e.g., addresses or latitude/longitude coordinates. Examples of applications include the incorporation of spatial gegevens into spatial regressions (e.g., how is consumer preparedness to purchase a product correlated with location?) or simulations (e.g., how would a manufacturing supply chain network perform with sites ter different locations?).
- Statistics. The science of the collection, organization, and interpretation of gegevens, including the vormgeving of surveys and experiments. Statistical technologies are often used to make judgments about what relationships inbetween variables could have occurred by chance (the “null hypothesis”), and what relationships inbetween variables likely result from some zuigeling of underlying causal relationship (i.e., that are “statistically significant”). Statistical technologies are also used to reduce the likelihood of Type I errors (“false positives”) and Type II errors (“false negatives”). An example of an application is A/B testing to determine what types of marketing material will most increase revenue.
- Supervised learning. The set of machine learning technologies that infer a function or relationship from a set of training gegevens. Examples include classification and support vector machines.30 This is different from unsupervised learning.
- Simulation. Modeling the behavior of ingewikkeld systems, often used for forecasting, predicting and script programma. Monte Carlo simulations, for example, are a class of algorithms that rely on repeated random sampling, i.e., running thousands of simulations, each based on different assumptions. The result is a histogram that gives a probability distribution of outcomes. One application is assessing the likelihood of meeting financial targets given uncertainties about the success of various initiatives.
- Time series analysis. Set of mechanisms from both statistics and signal processing for analyzing sequences of gegevens points, indicating values at successive times, to samenvatting meaningful characteristics from the gegevens. Examples of time series analysis include the hourly value of a stock market index or the number of patients diagnosed with a given condition every day.
- Time series forecasting. Time series forecasting is the use of a proefje to predict future values of a time series based on known past values of the same or other series. Some of thesis mechanisms, e.g., structural modeling, decompose a series into trend, seasonal, and residual components, which can be useful for identifying cyclical patterns te the gegevens. Examples of applications include forecasting sales figures, or predicting the number of people who will be diagnosed with an infectious disease.
- Unsupervised learning. A set of machine learning technologies that finds hidden structure ter unlabeled gegevens. Cluster analysis is an example of unsupervised learning (ter tegenstelling to supervised learning).
- Visualization. Technologies used for creating pics, diagrams, or animations to communicate, understand, and improve the results of big gegevens analyses. This expands into creating dashboards, on web or desktop platforms.
Hope this somewhat elaborate write up gives you some inspiration to hold on to. Stay blessed and stay inspired!