Facebook EmaiInACirclel
Data Strategy

Big Data: A challenge for massive data analysis systems

PentaGuy
PentaGuy
Blogger

The conception and deployment of Big Data analysis systems is not a trivial task. According to one of its definitions, Big Data has gone beyond the capacity of the existing hardware and software platforms. These new platforms require, in their turn, new infrastructures and models to tackle the broad spectrum of Big Data challenges. Recent studies [1], [2], [3] have emphasised the potential obstacle to the growth of Big Data applications.

In this article, I will try to classify these challenges in three categories: data collection and management, data analysis and system-related issues.

Data collection and management

Data collection and management process large amounts of heterogeneous and complex data. The following Big Data challenges should be covered:

  • Data representation: many datasets have certain levels of heterogeneity in type, structure, semantics, organisation, granularity and accessibility. Coherent data representation should reflect data structure, hierarchy and diversity, as well as an integration technique so as to enable efficient operations on different datasets.
  • Redundancy reduction and data compression: as a general rule, there is a high level of redundancy in raw datasets. Redundancy reduction and data compression without sacrifying potential value are efficient means to reduce the overload of the entire system.
  • Data lifecycle management: detection and omnipresent calculus are generating data at an unprecedented pace and scales, exceeding the progress made by storage system technologies. As a result, the main challenge for the existing storage systems resides in hosting these large amounts of data. Generally speaking, values hidden in Big Data depend on data freshness. Therefore, an importance principle related to the analytical value should be developed to decide which data shall be stored and which data shall be discarded.
  • Data confidentiality and security: in the context of online service and mobile phone proliferation, the number of confidentiality and security issues related to personal data access and processing is constantly increasing. It is thus essential to understand what privacy support should be offered at platform level to eliminate the leak of privacy information and facilitate its different processing methods.

Data analysis

The progress registered in relation to the processing of large amounts of data, including data interpretation, modelling, prediction and simulation will soon have a significant impact. These massive amounts of data, heterogeneous data structures and their different implementation methods pose enormous challenges:

  • Approximate analyses: since the number of datasets is constantly increasing and real-time constraints are almost imperative, the analysis of datasets has become more and more complicated. This issue can be solved by providing approximate results, via an approximate search, for instance. The notion of approximation has two dimensions: result accuracy and output omitted groups.
  • Linking of social media: social media have unique properties, such as immensity, static redundancy and user feedback. Different data extraction techniques have been successfully used to identify references based on social media for specific product names, places or persons on web sites. By connecting inter-domain data with social media, applications can reach high accuracy levels and distinct points of view.
  • Deep analytics: one of the most exciting things concerning Big Data is the possibility to access new opportunities. Sophisticated analytical technologies, such as Machine Learning, are necessary for unlocking these ideas. However, effectively drawing on these analysis tools requires a good mastering of probabilities and statistics. The potential pillars of confidentiality and security mechanisms are mandatory access control, communication in terms of safety, multi-granularity access control, data exploration and analysis, storage and security management.

System issues

  • Energy management: the energy consumption of large scale information systems has become one of the main economic and environmental issues of today’s IT world. With the increase of data volume and analytical demands, data transmission, storage and processing will inevitably consume more and more electric energy.
  • Evolutivity: the analytical system of Big Data must support large datasets. All components of Big Data systems must be able to evolve in order to process increasingly expanding and more complex datasets.
  • Cooperation: analysis of Big Data is an interdisciplinary field of research which requires experts in different fields cooperating to harvest the hidden potential of Big Data. A cyber Big Data infrastructure is necessary for enabling scientists and engineers to access various types of data, use their expertise and cooperate to complete the analytical objectives.

My next article will focus on data collection via various existing data sources.

[1] E. B. S. D. D. Agrawal et al., “Challenges and opportunities with big data: A community white paper developed by leading researchers across the united states,” The Computing Research Association, CRA White Paper, Feb. 2012.

[2] A. Labrinidis and H. V. Jagadish, “Challenges and opportunities with big data,” Proc. VLDB Endowment, vol. 5, no. 12, pp. 20322033, Aug. 2012.

[3] S. Chaudhuri, U. Dayal, and V. Narasayya, “An overview of business intelligence technology,” Commun. ACM, vol. 54, no. 8, pp. 8898, 2011.


Leave a Reply

Your email address will not be published. Required fields are marked *