Urban Data Challenges

Current Big Data technologies, including new big data architectures as well as scalable knowledge discovery and machine learning algorithms, are already very efficient in managing massive streams of information. However, despite the enormous recent progress in the Big Data space, current approaches have severe limitations when managing urban data streams. Addressing the veracity and variety are the main impediments, and therefore the main topics of current work.

Considering data sources in concert is an important factor in improving our uderstanding of the data, however cannot fully address the inherent noisy nature of the data, and the fact that in general we lack
reliable information to label or classify the data. The size and complexity of the datasets makes reliance to manual resources impractical; however having ground truth information is very important for any algorithm that attempts to improve the citizen experience. To address this problem we introduce novel Visual Analytics and Crowdsourcing techniques that optimize the efficiency that user input can be incorporated in the analysis tasks. The following list summarizes the challenges that have to be met in order to effectively exploit urban data.

Diversity, Multi-Modality: Urban streams are diverse and heterogeneous not only in content, but also in format and degree of granularity, with different scales in space and time, from sources with different levels of trust.
Complementarity: In addition, different data sources, provide a different view of events or aspects of the urban environment. To construct the full picture, multiple streams should be put together.
Lack of Labels and noise: This is a major concern since lack of ground truth knowledge hinders a lot of machine learning and data mining algorithms, and makes understanding of the data difficult. Due to the difficulty and expense in providing labeled data, provisions for addressing this problem must be taken.
Constraints and cost: Urban streams are many times constrained meaning that access to them might be limited. In these cases, components that enable the efficient selection of data are required. Different streams have different costs for collection and processing. For any analysis tasks potential tradeoffs (for example expensive data sources might be more accurate than cheaper data sources) must be explored. In addition, urban data have important privacy contraints. All these constraints must be integrated in the developed techniques.

Acknowledgement