Topics at the Dublin Lab
Data privacy for health care
Organizations, public bodies, institutes and companies gather enormous volumes of data that contain personal information. For reputation, compliance and legal reasons, the personal information needs to be de-identified before shared with third parties, such as analytics teams or research scientists. The healthcare domain is particularly challenging since it deals with highly sensitive information. The de-identification process aims to achieve the following three goals: a) significantly and provably minimize the re-identification risk b) maintain a high level of data utility to allow supporting intended secondary purposes and c) maintain the truthfulness of the data at a record level to the largest possible extent.
This project aims to explore innovative ways to provide a framework for calculating re-identification risk in meaningful and realistic settings and generating reports for a mixed audience of technical, legal and compliance audience. The project will build the foundational metrics for capturing the balance between information loss and risk assessment. The end goal is a research prototype that will demonstrate the framework in various scenarios.
Multi-modal information retrieval for annotating text documents with relevant images
Recent advances in word embedding approaches have taken the natural language processing and the speech processing community by storm. Joint embedding of text (ranging from character n-grams to whole documents) with images has opened up avenues to explore multi-modal data processing. In particular, this proposed project will investigate the potential effectiveness of joint embedding approaches, i.e. images with words (Frome et. al., NIPS '13), for multi-modal information retrieval. The specific application that we are interested in is to automatically enhance the readability of a text document, e.g. a Wiki page, by automatically inserting relevant images in appropriate places of the text.
This project is concerned with developing a system that “listens” to a meeting where decisions are to be made, creates a summary of the discussion pertaining to the decision (identifies the decision, the proposed alternatives (and by whom), the criteria discussed, the constraints discussed).
This project offers a variety of options in terms of the specific content of the internship: (i) developing a new set of annotation on an existing corpus, (ii) starting the development of a new corpus, (iii) developing new algorithms for alternatives and preference extraction (iv) starting a new task such as discussion segmentation (when are people discussing about a decision and when are they discussing other topics) or agreement detection (detecting when an agreement has been reached in the conversation, if any).
Natural Language Processing, Text mining, Machine Learning
Machine learning models for the humanitarian sector
65 million people are globally displaced, the highest ever in human history. Humanitarian aid budgets are also the largest they have been, yet only ~20% of aid recipients feel their needs have been met. The intern will contribute to the ongoing effort in humanitarian needs assessment by building models that leverage data sources from humanitarian agencies to estimate different types of relief needed during crisis.
Data mining, machine learning
Probabilistic preference model to account for incomparability
When comparing options that that are judged on several attributes (e.g. apartments or jobs) some comparisons are more difficult than others. For instance, it is difficult to choose between two apartments if one is well located but very expensive and the other is affordable but poorly located. When posed with such comparisons, that involve a significant trade-off across attributes, it is more likely that decision-makers will express incomparability or indifference. We propose to use a random utility model to represent this effect. In this model, attribute weights and marginal utility functions parameters are drawn from probability distributions whose parameters represent the DM's preferences.
The intern will contribute to the development of the model and test its accuracy on real experimental data. The internship provides an excellent opportunity to learn about decision analytics, and how to improve preference elicitation by taking behavioral results under consideration.
Machine learning, Bayesian inference, Multi-attribute preference models, familiarity with Matlab.
Virtual testing and hardware-in-the-loop simulation
Vehicles are undergoing a huge revolution. They are transitioning from being isolated entities operating on the road to be connected, informed, devices. The goal of this project is to design new collaborative services for connected vehicles, which leverage technological IoT innovation and mathematical rigour. In particular, the focus of the project is on designing and developing a Hardware-in-the-Loop (HiL) platform to validate large scale systems arising in a number of applications for partially-autonomous driving functions and cognitive automotive analytics.
Data-driven robust optimization
IBM Research Ireland is seeking a summer intern in the area of Robust Optimization (RO). Specifically, the candidate will be required to further develop a distributionally robust optimization approach from a data-driven perspective. While some RO approaches build uncertainty sets directly from data, most of the models in the Robust Optimization literature are not directly connected to data. Recent work on this issue have started to lay a foundation to this perspective. Further developing a data-driven theory of RO is interesting from a theoretical perspective, and also compelling in a practical sense, as many real-world applications are data-rich. The candidate will be required to scope, improve and apply existing algorithms to a set of applications that are of relevance to IBM. Those include but are not restricted to cognitive IOT, portfolio optimisation and air traffic management among others.
Distributed optimization and consensus
We are now in a time where everything can be interconnected and programmed: the IoT is rapidly leading to a new industrial revolution, where a network of objects communicate with each other and objects (or nodes) indeed collaborate in order to fulfil a common goal that could not be achieved by any individual object, if this was isolated.
This new technological paradigm is also leading to a new paradigm in Control Theory: for these networked applications, it is indeed more convenient to design “local”, or decentralized control protocols, residing onto each node, rather than designing a central controller orchestrating the behaviour of all the objects in the network.
We seek to explore new methodologies to design novel decentralized control protocols for one or more of the following cases:
The duties of the intern include:
Real-time car sharing
The overall goal is to allow commercial and individual car sharing possibilities using automated driving cars. More specifically, the project aims to develop real time car sharing concepts. This includes the optimisation of automated driving vehicles allocation, pick-up, and drop-off, based on end user needs and real time and reliable information about the actual vehicles' statuses and their scheduled routes. Work should build on existing IBM assets and previous projects.
Real-time filters for monitoring drivers behavior
Poor car-following behavior is responsible for a significant number of car accidents. In this work, we aim to design and implement a system that trains the driver to be a better driver utilising the measurements from the vehicle proximity sensors, e.g. relative distance and speed to leading vehicle(s). Recent work has shown that offline parameter identification of car-following models was a tedious task that requires precise knowledge of the model specificity's, i.e. parameters may only be identifiable in specific traffic regimes.
The work will consist in designing an algorithm that efficiently performs online parameter identification of car-following models given the available measurements. The first step is to study the mapping between the identifiability of car-following parameters with their corresponding traffic flow regime. The second step is the design an online filter that integrates this mapping. Then, a risk model based on safety indicators and simulation analysis will be derived. Finally, a prototype system will be implemented. The system will only intervene if a dangerous behaviour is detected according to the risk model.
System identification, particle filter, Python.
Feedback control of the modal shift considering priority queues in traffic light control
The modal shift towards public transport is the priority of many city policy makers. New techniques of traffic optimisation in cities look into prioritising traffic fleets that are less prone to deteriorating pollution and noise levels, i.e. pedestrians, cyclists, buses. Recent work has been done on queue-based traffic light optimisation.
This work will adapt recent work on prioritising specific traffic classes. It will assess how this new optimisation policy affects travel times of pedestrians, buses, vehicles, etc. in a city considering a realistic demand model for trips in the city. Then by considering the elasticity coefficient between travel times and OD trips available in the literature, it will investigate how many citizens are actually contributing to the modal shift towards public transport. A feedback mechanism to control the modal shift will be proposed.
Strong interest in optimisation, genuine interest in code development (Python, SUMO, GIS) and in transport policy.
City-scale real-time pollution estimation from roadway traffic
Pollution monitoring is only starting to take place in cities thanks to the increasing environmental data sources, e.g. air pollution measurements, weather data, precise knowledge of traffic volumes. Recent work in this area includes a data assimilation framework for urban air pollution monitoring and a modelling chain to estimate pollution levels from highway traffic.
This project will take advantage of available data feeds for traffic volumes, weather conditions, air pollution levels at stations across a city. The goal of this project is to be able to assess with accuracy how much city traffic contributes to air pollution levels. Pollutant emissions and dispersion levels will be integrated in a data assimilation framework. This will provide insights on how to better manage air pollution levels on critical days by leveraging city traffic.
Code development (interfacing different types of models, different types of data sources), critical thinking, genuine interest in sustainable cities.
Cognitive disruption control in complex schedules
Schedule disruptions are a significant source of unplanned costs in various transport services (e.g., airlines, rail). Operations controllers can mitigate the financial and service impacts of disruptions if they can estimate the costs of the disruptions and have suggestions as to what to do.
Ensemble based forecasting of wave conditions
Ensemble techniques have been demonstrated to outperform individual models in operational forecasting and minimising prediction errors (Mallet and Sportisse, 2006). This is particularly relevant for forecasting wave conditions in coastal ocean regions subject to model errors arising from incorrect forcing data, model parametrizations and model structural errors (Rogers et al., 2005).
In this study we aim to combine physics models of near-shore circulation and wave characteristics with ensemble forecasting methods to generate optimal forecasts with defined uncertainty. The approach is applied to a case-study site, Santa Cruz, California. The system involves a coupled wave model and circulation model. Circulation patterns are resolved by EFDC, a 3D circulation model, while wave information is computed using SWAN, a third-generation wave model that computes wind-generated waves in coastal and inland waters. Input data includes a high-resolution meteorological field with predictions highly sensitive to the accuracy of wind fields.
We aim to investigate methodologies to optimally combine multiple forecasts of wave characteristics. We investigate different linear combinations of models to improve performance of model-data comparisons. The weights attached to these models are investigated and techniques to select and forecast optimum weights evaluated
The ideal candidate will have experience in numerical modelling and the Linux/unix environment. In addition, the ability to analyse large datasets combing basic statistics with choice of analysis software (R, Python, etc.) is useful.
 Mallet, V., Sportisse, B., 2006. Ensemble-based air quality forecasts: A multimodel approach applied to ozone. J. Geophys. Res. Atmospheres 111.
Forking events analysis in blockchain protocols
Blockchain is the fabric at the core of crypto-currencies and it is based on the concept of algorithmic consensus. A methodology to preserve the fabric in the presence of consensus breaking inconsistencies (generated by an adversary or by inconsistent sub-versioning of the core software) is for a Blockchain to be forked. While this can be considered a brute force patch it has been implemented more than once in widely used crypto-currencies. This project aims at applying probability theory and advanced computational complexity to study current and past fork events in crypto-currencies, their root causes, their impact on the Blockchain fabric and the possibility of alternative consensus repairing solutions.
Cognitive inverse modelling and its application to hydraulic diffusivity inversion
Extraction of fluids from porous media is critical for both petroleum resource management and supplying drinking water to a global population. Sparse sampling through wells and the heterogeneity of geologic formations make inverse estimation of the permeability field difficult and an under-determined problem. In this project we propose a cognitive strategy for estimating the heterogeneous diffusion coefficient – permeability field – of a 2D confined aquifer. The aquifer is modeled using a 2D linear Darcy equation within a relatively simple geometry. The permeability field is to be inferred from given measurements of water levels from a network of wells.
The key idea behind this project is to use state of the art spectral cluster algorithms to learn clusters in the input space, e.g. massive amounts of geo-physically plausible samples of permeability fields, from the data available in the output space, e.g. observed water heights at well locations. The expected outcome is:
Uncertainty quantification is a second important point in the project: clusters are similar in terms of the misfit function, hence we can expect that, after the inversion the members of the clusters will allow one to quantify the misfit variance associated to the optimized cluster centers. The student will work with IBM Research staff to develop/implement/test a cognitive inversion prototype for 2D Darcy flows. The work-flow of the project will be as follows:
The final goal of the project is to develop/test the prototype and implement it as a C++ library of a general purpose.
Ideal intern skills
Automatic predictive maintenance
Predictive maintenance plays an important role in different industries including manufacture, healthcare and transportation. In practice, predictive maintenance is very time demanding and requires skilled data scientists to extract hand-crafted features from sensor data. In this project, an intern student will tackle these problems by automating this process. In particular, the automation for predictive maintenance must work with raw sensor data collected from IoT devices. Sensor data might be erroneous, missing and sampled at different resolutions. The intern student will design algorithms to efficiently search for relevant features to predict failures of a device using state of the art feature and representation learning techniques from the data mining and deep learning fields.
Machine learning, statistics, data mining, time series analysis, deep learning, programming (Python, Spark (optional)).
Learning like humans using deep learning
Help us push the boundaries of deep-learning and AI. We use deep learning on large scale data to help cognitive systems understand the relationships between data of different types:
Can an AI understand the weather?
The weather impacts everyone, however, finding the relationships between weather and human behaviour is challenging. This challenge is due to both the complexity of human behaviour and the massive scale of weather data. We use deep learning and other machine learning tools to predict the impact of weather on transportation, airlines, agriculture, and more. You will learn how to train machine learning models at scale. This internship will work towards having an impact on IBM's clients.
Large-scale cognitive workload optimization
Machine Learning and Applied AI techniques are broadly used to come up with prompt answers and provide insights to challenging problems, crossing industries and societal challenges. IBM is spearheading the transition to cognitive computing. As part of your internship, you will be given the opportunity to create impact in optimizing performance vs efficiency tradeoffs at the intersection of machine/deep learning runtimes and OpenPOWER systems, when solving real-life problems.
More info: See blog article “Putting the AI in PowerAI”
Pathfinding in new cloud architectures
As the transition to cloud computing keeps progressing, we are in the exploration of new cloud architectures in multiple directions, such as e.g. bringing new levels of efficiency through resource pooling or extracting more value through hw/sw specialization to common denominator needs of cloud workloads and services. As part of your internship, you will have the opportunity to work with us on related pathfinding activities, testing ideas on experimental architectures featuring resource disaggregation and near-data computing.
Healthcare & social care
Are you looking to perform cutting-edge research with real-world impact? The IBM Research Lab in Dublin seeks talented and enthusiastic research interns to join our team.
We are interested in candidates with excellent technical skills who are willing in applying their research skills to solve real-world problems. You will work with cutting edge technologies, researching novel techniques to acquire, represent and exploit urban, environmental, social and health data and information to improve how health in managed and delivered.
We are looking for PhD students in the domains of health informatics, nursing informatics, social care informatics or related disciplines. Experience with artificial intelligence, data management, machine learning or data mining techniques is preferred.