The Artificial Intelligence Research group at Lockheed Martin has bot investigating and developing DM instruments for the past Ten years.
Report by Sudeshna Basu, Georgia State University Fall 1997
Overheen the past decade or so, businesses have accumulated phat amounts of gegevens ter large databases. Thesis stockpiles mainly contain customer gegevens, but the gegevens’s hidden value–the potential to predict business trends and customer behavior–has largely gone untapped.
To convert this potential value into strategic business information, many companies are turning to gegevens mining, a growing technology based on a fresh generation of hardware and software. Gegevens mining combines technologies including statistical analysis, visualization, decision trees, and neural networks to explore large amounts of gegevens and detect relationships and patterns that shed light on business problems. Ter turn, companies can use thesis findings for more profitable, proactive decision making and competitive advantage. Albeit gegevens mining instruments have bot around for many years, gegevens mining became feasible te business only after advances ter hardware and software technology came about.
Hardware advances–reduced storage costs and enhanced processor speed–paved the way for gegevens mining’s large-scale, intensive analyses. Inexpensive storage also encouraged businesses to collect gegevens at a high level of detail, consolidated into records at the customer level.
Software advances continued gegevens mining’s evolution. With the advent of the gegevens warehouse, companies could successfully analyze their massive databases spil a samenhangend, standardized entire. To exploit thesis vast stores of gegevens te the gegevens warehouse, fresh exploratory and modeling tools–including gegevens visualization and neural networks–were developed. Eventually, gegevens mining incorporated thesis instruments into a systematic, iterative process.
Schutsluisje Institute understands the key issues and challenges facing businesses today–including the need to control costs, build up customer relationships, and create and sustain a competitive advantage.
Verlaat Institute defines gegevens mining spil the process of selecting, exploring, and modeling large amounts of gegevens to uncover previously unknown patterns for a business advantage. Spil a sophisticated decision support device, gegevens mining is a natural outgrowth of a business’ investment te gegevens warehousing. The gegevens warehouse provides a stable, lightly accessible repository of information to support dynamic business intelligence applications.
Spil the next step, organizations employ gegevens mining to explore and monster relationships te the large amounts of gegevens ter the gegevens warehouse. Without the pool of validated and ",scrubbed", gegevens that a gegevens warehouse provides, the gegevens mining process requires considerable extra effort to pre-process gegevens.
Albeit the gegevens warehouse is an ideal source of gegevens for gegevens mining activities, the Internet can also serve spil a gegevens source. Companies can take gegevens from the Internet, mine the gegevens, and distribute the findings and models via the company via an Intranet.
There’s gold ter your gegevens, but you can’t see it. It may be spil plain (and wealth-producing) spil the realization that baby-food buyers are very likely also diaper purchasers. It may be spil profound spil a fresh law of nature. But no human who’s looked at your gegevens has seen this hidden gold. How can you find it?
Gegevens mining lets the power of computers do the work of sifting through your vast gegevens stores. Tireless and relentless searching can find the lil’ nugget of gold ter a mountain of gegevens slag.
Ter ",The Gegevens Gold Rush,", Sara Reese Hedberg shows the already broad multiplicity of uses for the relatively youthful practice of gegevens mining. From analyzing customer purchases to analyzing Supreme Court decisions, from discovering patterns ter health care to discovering galaxies, gegevens mining has an enormous breadth of applications. Large corporations are rushing to realize the potential payoffs of gegevens mining, both ter the gegevens itself and te marketing their proprietary devices.
Te ",A Gegevens Miner’s Instruments,", Karen Watterson explains the three categories of software to perform gegevens mining. Query-and-reporting implements, te vastly simplified and easier-to-use forms, require close human direction and gegevens laid out te databases or other special formats. Multidimensional analysis (MDA) devices request less human guidance but still need gegevens te special forms. Slim agents are virtually autonomous, are capable of making their own observations and conclusions, and can treat gegevens spil free-form spil paragraphs of text.
",Gegevens Mining Dynamite", by Cheryl D. Krivda shows how to facilitate the data-mining process. Gegevens is treated far swifter after it has bot cleansed of unnecessary fields and stored ter more convenient forms. Housing gegevens ter gegevens warehouses reduces the blast on production mainframes and supports client/server analysis. Parallel computing speeds the search process with numerous simultaneous queries. And any activity treating this volume of gegevens requires consideration of physical storage options.
Ter the brief term, the results of gegevens mining will be te profitable if mundane business-related consequences. Micro-marketing campaigns will explore fresh niches. Advertising will target potential customers with fresh precision.
Te the not-too-long term, gegevens mining may become spil common and effortless to use spil E-mail. Wij may meteen our contraptions to find the best airfare to the Grand Canyon, root out a phone number for a long-lost classmate, or find the best prices on lawn mowers. The software will figure out where to look, how to evaluate what it finds, and when to abandon. Our skill helpers may become spil indispensable spil the telephone.
But it’s the long-term prospects of gegevens mining that are truly breathtaking. Imagine slim agents being turned liberate on medical-research gegevens or on subatomic-particle information. Computers may expose fresh treatments for diseases or fresh insights into the nature of the universe. Wij may well see the day when the Nobel prize for a superb discovery is awarded to a search algorithm.
The amount of information stored te databases is exploding. From zillions of point-of-sale transactions and credit card purchases to pixel-by-pixel pics of galaxies, databases are now measured ter gigabytes and terabytes. Te today’s fiercely competitive business environment, companies need to rapidly turn those terabytes of raw gegevens into significant insights to guide their marketing, investment, and management strategies.
It would take many lifetimes for an analyst to pore overheen Two million books — the omschrijving of a terabyte — to glean significant trends. But analysts have to. For example, Wal-Mart, the chain of overheen 2000 retail stores, every day uploads 20 million point-of-sale transactions to an AT&,T massively parallel system with 483 processors running a centralized database. At corporate headquarters, they want to know trends down to the last Q-Tip.
Fortunately, pc technologies are now being developed to assist analysts ter their work. Gegevens mining (DM), or skill discovery, is the computer-assisted process of digging through and analyzing enormous sets of gegevens and then extracting the meaning of the gegevens nuggets. DM is being used both to describe past trends and to predict future trends.
Mining and Refining Gegevens
Experts involved ter significant DM efforts agree that the DM process voorwaarde start with the business problem. Since DM is indeed providing a toneelpodium or workbench for the analyst, understanding the job of the analyst logically comes very first. Once the DM system developer understands the analyst’s job, the next step is to understand those gegevens sources that the analyst uses and the practice and skill the analyst brings to the evaluation.
The DM process generally starts with collecting and cleaning information, then storing it, typically te some type of gegevens warehouse or datamart (see figure below). But te some of the more advanced DM work, such spil that at AT&,T Bell Labs, advanced knowledge-representation contraptions can logically describe the contents of databases themselves, then use this mapping spil a meta-layer to the gegevens. Gegevens sources are typically vapid files of point-of-sale transactions and databases of all flavors. There are experiments underway ter mining other gegevens sources, such spil IBM’s project ter Paris to analyze text straight off the newswires.
THE Gegevens MINING PROCESS
DM devices search for patterns ter gegevens. This search can be performed automatically by the system (a bottom-up dredging of raw facts to detect connections) or interactively with the analyst asking questions (a top-down search to test hypotheses). A range of rekentuig implements — such spil neural networks, rule-based systems, case-based reasoning, machine learning, and statistical programs — either alone or te combination can be applied to a problem.
Typically with DM, the search process is iterative, so that spil analysts review the output, they form a fresh set of questions to refine the search or elaborate on some facet of the findings. Once the iterative search process is finish, the data-mining system generates report findings. It is then the job of humans to interpret the results of the mining process and to take activity based on those findings.
AT&,T, A.C. Nielsen, and American Express are among the growing ranks of companies implementing DM technics for sales and marketing. Thesis systems are crunching through terabytes of point-of-sale gegevens to aid analysts ter understanding consumer behavior and promotional strategies. Why? To increase profitability, of course.
Similarly, financial analysts are plowing through vast sets of financial records, gegevens feeds, and other information sources ter order to make investment decisions. Health-care organizations are examining medical records ter order to understand trends of the past, they hope this information can help reduce their costs ter the future. Major corporations such spil General Motors, GTE, Lockheed, Microsoft, and IBM all have R&,D groups working on proprietary advanced DM technologies and applications.
Hardware and software vendors are extolling the DM capabilities of their products — whether they have true DM capabilities or not. This hype cloud is creating much confusion about gegevens mining. Te reality, gegevens mining is the process of sifting through vast amounts of information te order to samenvatting meaning and detect fresh skill.
It sounds ordinary, but the task of gegevens mining has quickly shocked traditional query-and-report methods of gegevens analysis, creating the need for fresh devices to analyze databases and gegevens warehouses intelligently. The products now suggested for DM range from on-line analytical processing (OLAP) instruments, such spil Essbase (Arbor Software ) and DSS Smeris (MicroStrategy), to DM contraptions that include some AI technics, such spil IDIS (Information DIscovery System, from IntelligenceWare) and the Database Mining Workstation (HNC Software), to the fresh vertically targeted advanced DM contraptions, such spil those from AT&,T Global Information Solutions.
Many people argue that the OLAP implements are not ",true", mining instruments, they’re fancy query implements, they say. Since thesis programs perform sophisticated gegevens access and analysis by rolling up numbers along numerous dimensions, some analysts still include them te the category of top-down mining contraptions. The market has yet to see much ter the way of more-advanced mining devices, albeit the spigot is being turned on by application-specific DM contraptions from AT&,T, Lockheed, and GTE.
One major DM trend is the budge toward powerful application-specific mining instruments. ",There is a trade-off te the generality of data-mining implements and ease of use,", observes Gregory Piatetsky-Shapiro, principal investigator of the Skill Discovery te Databases Project at GTE Laboratories. ",General instruments are good for those who know how to use them, but they truly require lots of skill to use them.",
AT&,T, for example, recently introduced Sales &, Marketing Solution Packs to mine gegevens warehouses. They’re tailored to vertical markets ter retail, financial, communications, consumer-goods manufacturing, transportation, and government. Thesis programs provide about 70 procent of the solution, with final tailoring required to getraind the individual client’s needs, AT&,T says. Accomplish with AT&,T parallel hardware, software, and some services, Solution Packs begin at around $250,000.
Both GTE and Lockheed Martin may shortly go after suit. GTE is already entertaining proposals to turn its Health-KEFIR (KEy FIndings Reporter) into a commercial product . The Artificial Intelligence Research group at Lockheed Martin has bot investigating and developing DM implements for the past Ten years. Recently, the Lockheed group built an internal application-development instrument, called Recon, that generalizes their DM technologies, then applied it to application-specific problems. A beta version of the very first vertical packages — for finance and marketing — will be available te 1996. The system has an open architecture, running on Unix platforms and massively parallel supercomputers. It interfaces with existing relational database management systems, financial databases, proprietary databases, gegevens feeds, spreadsheets, and ASCII files.
Ter a similar vein, several neural network devices have bot customized. Customer Insight Co., for example, has built an interface to verbinding its Analytix marketing software with HNC Software’s neural network-based Database Mining Workstation, creating a marketing DM hybrid. HNC Software’s Falcon detects credit-card fraud, according to HNC, the program is watching millions of charge accounts.
Invasion of the Gegevens Snatchers
The need for DM contraptions is growing spil prompt spil gegevens stores erect. More-sophisticated DM products are beginning to emerge that perform bottom-up spil well spil top-down mining. The day is most likely not too far off when slim juut technology will be harnessed for the mining of vast public on-line sources, traversing the Internet, searching for information, and presenting it to the human user. Microelectronics and Laptop Technology Corp. (MCC, Austin, TX) has bot pioneering work te this area, developing a toneel, called Carnot, for its consortium members. Carnot-based agents have bot successfully applied to both top-down and bottom-up DM of distributed heterogeneous databases at Eastman Chemical.
",Gegevens mining is evolving from answering questions about what has happened and why it happened,", observes Mark Ahrens, director of custom-built software sales at A.C. Nielsen. ",The next generation of DM is focusing on answering the question `How can I fix it?’ and making very specific recommendations. That’s our concentrate now — our Holy Grail.", Meantime, the gold rush is on.
Gegevens mining is the search for relationships and global patterns that exist ter large databases, but are `hidden’ among the vast amounts of gegevens, such spil a relationship inbetween patient gegevens and their medical diagnosis. Thesis relationships represent valuable skill about the database and objects te the database and, if the database is a faithful mirror, of the real world registered by the database. One of the main problems for gegevens mining is that the number of possible relationships is very large, thus prohibiting the search for the keurig ones by ordinary validating each of them. Hence, wij need slim search strategies, spil taken from the area of machine learning. Another significant problem is that information te gegevens objects is often corrupted or missing. Hence, statistical mechanisms should be applied to estimate the reliability of the discovered relationships.
Instruments AND Technologies
Gegevens visualization software is one of the most versatile devices for gegevens mining exploration. It enables you to visually interpret complicated patterns ter multidimensional gegevens. By viewing gegevens summarized te numerous graphical forms and dimensions, you can uncover trends and spot outliers intuitively and instantaneously.
Te the gegevens mining process, visualization devices help you explore gegevens before modeling–and verify the results of other gegevens mining technologies. Visualization implements are particularly useful for detecting patterns found ter only puny areas of the overall gegevens.