Things you must know about Data Mining research

Big data is data that has been defined as having all of the characteristics defined by the “4 Vs”: volume (lots of data), variety (highly diverse data), velocity (changing very fast), and veracity (hard to fully validate). Data Mining is an exploratory data-analytic process that detects interesting, novel patterns within one or more data sets (that are usually large). It employs a variety of techniques, including machine learning and standard multivariate statistical techniques.

We are living in an era of big data, where the process of generating data is continuously taking place with each coming second. However, data presents major challenges to organizations, especially data that is more varied and extremely complex in structure (unstructured/semi-structured) with problems including the indexing, sorting, searching, analyzing and visualizing of data.

What is Data Mining Research?

Data mining research is one of the topics that Open-Source Research Collaboration or OpenSourceResearch (OSRC) will be exploring as a key research area. OSRC is an international independent research organisation that aims to implement information technologies in healthcare research. It offers education and training to researchers with a special focus on researchers from developing countries. It is, as such, an incubator of healthcare innovations. Big data mining needs good judgment in trying something (including a new data source) in pilot mode, carefully quantifying investment versus benefit, and cutting your losses and walking away in case of failure.

Borderline-quality data may be salvaged through data-cleansing methods, but these methods must be algorithmic, that is, employing computer programs and hoping that the false-positive and false-negative error rates are modest. There are insufficient resources available to curate data manually: data is coming in too fast for that to happen, and there is far too much of it. Consequently, data researchers often look for data that is 95% accurate, or even 90% accurate. The effort required to move from 95% accuracy to 99% accuracy is disproportionate to the gain in quality—and to move to 99.9% accuracy is more challenging still.

Almost 200,000 datasets from 170 outfits have been posted on the data.gov website. Nearly 70 other countries have also made their data available: mostly rich, well-governed ones, but also a few that are not, such as India (see chart). Open Knowledge, a London-based group, estimates that over 1m datasets have been published on open-data portals using its CKAN software, developed in 2010.

Given the astonishing scale of the data deluge, it is reasonable to ask why more has not been achieved. There are four answers to this question. First, the data that has been made available is often useless. Second, the data engineers and entrepreneurs who might be able to turn it all into useful, profitable products, find it hard to navigate. Third, too few people are capable of mining data for insights or putting the results to good use. Finally, it has been hard to overcome anxieties about privacy.

Open-data activists have joined forces with bureaucrats and entrepreneurs to sort out all the problems. The thorniest problem for open data might be privacy. Governments rushing to release individual-level data such as tax, medical or education records are “walking into a massive minefield”. However, such data is among the most valuable: it can boost, for example, precision medicine, which tailors each patient’s treatment. There are possible solutions and these are starting to work. Growing amounts of data are being put to good use. The future will witness increasing demand for skilled individuals who can work with big data.

Research skills resources

Research skills resources are instrumental to educate and train researchers as well as officials, activists and journalists. Matchmaking events that connect data custodians with analysts, coders and other experts are becoming more common. Data-users learn which are the best sources; officials learn how to make them useful. Open-data “hackathons” now attract hundreds of volunteers and budding entrepreneurs apiece. They were held in over 200 cities on February 21st, and enthusiasts have declared International Open Data Day. Investors are flocking to such events in growing numbers, increasing the chances that bright ideas turn into successful businesses.

Advocates such as Open Knowledge list the most valuable datasets and the features that make data truly open—such as an open licence or a machine-readable format. Many governments use these lists when they decide which datasets to release. A crowd of non-profit organizations are trying to improve people’s data-handling skills. They are publishing handbooks, organizing training and coming up with tools that generate easy-to-understand visual data summaries.

Scientific progress is harder to measure than economic progress. But one mark of eight is the number of patents produced, especially relative to the investment in research and development. If it has become cheaper to produce new inventions, the suggestion is that we are using our information wisely and are forging it into knowledge. If it is becoming more expensive this suggests that we are seeing signals in the noise and wasting our time on false leads. OpenSourceResearch organization has been founded to improve health care research through the implementation of IT solutions.

The Economist Nov 21st, 2015 Out of the box: The open-data revolution has not lived up to expectations. But it is only getting started.
Clinical Research Computing edited by PRAKASH NADKARNI. Chapter 10: Core Technologies: Data Mining and “Big Data”

Blog Post

Things you must know about Data Mining research

What is Data Mining Research?

Research skills resources

sarahneilson