2021 Great Minds student internships

Topics at the Africa Lab
in Johannesburg

Ref. code
Project description
SA‑2021‑01

Neurosymbolic AI (foundational AI towards natural language understanding)

[ Project description | Close ]

Neural/sub-symbolic interpretations of logical inference and reasoning dates back to the first description of artificial neural networks and its use as threshold logic. However, symbolic AI was the dominant paradigm for decades during the advent of AI as it represented interpretable and general, human-like reasoning. Neuro-symbolism aims to combine the fault-tolerance, parallelism and learning of connectionism with the logical abstractions and inference of symbolism. Neuro-symbolism promises to combine the strengths of performing logical abstractions within connectionist settings. Neuro-symbolic integration can be done for e.g. (1) propositionalization of raw data for a symbolic interpretation; (2) predicate implementation to perform logical functions on ground propositions; (3) predicate invention for rule induction and theory learning; and (4) implementing various logic reasoning constructs like modus ponens, inference, implication, entailment and modal logic.

In this project, new neuro-symbolic architectures and models will be developed to demonstrate the superiority of neuro-symbolic learning and reasoning in natural language understanding. This will spur the development of complementary approaches that combine deep learning advancements with symbolic AI to express their strengths and supplement their weaknesses.

The intern will run experiments on real-world data, develop new models, and report the findings in scientific publication(s).

Intern responsibilities

  • Conduct research in the field of machine learning, deep learning, and AI
  • Develop generalizable and scalable AI and conversational architectures and systems applied to addressing the “big” challenges in Africa
  • Efficiently implement algorithms and run experiments on real data (either small, big or zero data)
  • Build information machines and smart applications that enable humans to make better decisions and perform more efficiently in their tasks and allow for their swift integration into industry software components

Requirements

  • Graduate students (MSc or PhD) in computer science or related areas such as EE, Mathematics, etc. Outstanding senior BSc undergraduate students may also apply
  • Strong background in machine learning, deep learning and applied statistics
  • Hands on experience in writing code
  • Creativity and innovative thinking
  • Experience in Python and in ML libraries, e.g. from Scikit-Learn (e.g., Pandas, Matplotlib, Seaborn etc.)
  • A publication record in top-tier conferences and journals would be beneficial
  • Knowledge of software engineering and blockchain protocols are a plus
SA‑2021‑02

Computational Genomics for TB Drug Resistance Profiling using AI

[ Project description | Close ]

Currently, phenotypic testing, which involves culturing the tuberculosis (TB) bacteria and determining its response to specific anti-TB drugs, is the gold standard of identifying resistant strains due to its high sensitivity. The main disadvantages of this technique are the long lead times and the high costs associated with the method. The efficacy of targeted gene sequencing techniques for the diagnosis of TB and its resistance to Rifampicin has been validated and shown to be another effective tool. However, these technologies only screen a limited number of genetic variations and are not designed to identify novel mutations or exclude drug resistance by other mechanisms and have suboptimal success rates when done directly on specimens as opposed to culture isolates.

The application of whole genome sequencing (WGS) as a tool for the diagnosis and clinical management of tuberculosis promises to circumvent the long lead times and the limited scope of conventional phenotypic drug susceptibility testing and targeted sequencing techniques. However, to achieve this target, novel methods, which harness the power of deep learning and the large volumes of data generated from WGS, integrated with clinical data are required to effectively identify these emerging mutations or identify novel genes associated with drug resistance. Critically, WGS data provides information beyond just drug susceptibility to include insights such as strain lineage, which could be used to intervene in nascent outbreaks. Such opportunities, however, are relatively unexplored.

In this project, deep learning methods will be applied to WGS data in conjunction with already-collected routine clinical and microbiological meta-data to (i) identify and predict novel drug-resistant mutations, and (ii) identify mycobacterial genetic biomarkers that predict tuberculosis treatment outcomes.

The intern will run experiments on real world data, develop new models, and report the findings in scientific publication(s).

Requirements

  • Strong programming skills in Python
  • Strong analytical and problem-solving skills
  • Excellent communication and team skills
  • Experience with AI / machine learning techniques
  • Experience using essential Python libraries such as Scikit-learn, Theano, NumPy, Matplotlib
  • Experience with TensorFlow or PyTorch machine-learning frameworks
SA‑2021‑03

Quantum Computing

[ Project description | Close ]

IBM Quantum is an industry first initiative to build universal quantum computers for business, engineering and science. This effort includes advancing the entire quantum computing technology stack and exploring applications to make quantum broadly usable and accessible. With a worldwide network of Fortune 500 companies, academic institutions, researchers, educators, and enthusiasts, we are committed to driving innovation for our clients in the IBM Q Network and the Qiskit Community.

Quantum computing promises to revolutionize the computing power available for certain applications. One such application is quantum chemistry, where the extra power could lead to world-changing solutions for climate change, food security, drug discovery and battery design.

Intern responsibilities

In this project, the intern will apply current quantum computing techniques for chemistry such as VQE to study a small scale-downed version of an HIV protease. Possible additional techniques include embedding the quantum calculation in a larger classical approximation (e.g. paper).

Requirements

  • Competent Python programming skills
  • At least an undergraduate chemistry level understanding
SA‑2021‑04

Applications of geospatial data analytics for climate risk assessment

[ Project description | Close ]

Downscaling Climate Prediction Models: AI-based climate predictions for regional impact assessments and other applications

This research is focussed on developing long-range seasonal climate forecasting models based on data-driven methods, namely machine learning techniques, extending what is currently possible with state-of-the art numerical climate models. Aspects that are being researched include the prediction of seasonal averages, extremes and mechanisms for weather generation as an approach for downscaling and modelling various climate scenarios. The South Africa research lab is primarily focussed on the development of seasonal forecasts which will be used in various climate impact assessments and in developing a general climate risk and resiliency framework.

Climate data is inherently spatiotemporal, which introduces many challenges when it comes to developing forecasting models. Therefore, different sub-projects are related to this goal of generating seasonal predictions using data-driven and machine learning approaches. The following are examples of sub-projects that could add value to the goal of improving long-range seasonal predictions:

  • Investigate what predictor variables (atmospheric, land or oceanic) are most relevant for achieving skilful temperature and precipitation forecasts and how this changes as a function of lead-time.
  • Use techniques like optimal input or layer-wise relevance propagation as an attempt to improve our understanding of what DL models learn and what features in the input are important to them. This can highlight known and possibly unknown teleconnections in covariates that have long-range effects on certain target variables.
  • Design more accurate climatologies that can be used as a rigorous baseline for forecasting, and to investigate if such a baseline could be useful when provided to ML models as additional input.
  • Compare skill of CNN and LSTM algorithms when doing long-range forecasting.
  • Combine CNN and LSTM models into a unified system that can take advantage of spatial and temporal characteristics in climate data.

Requirements

  • Graduate students (MSc or PhD) in computer science or related areas such as Electronic Engineering, Mathematics, Statistics, Environmental Sciences with Strong Computational Skills etc.
  • Some background in either machine learning or statistics, with deep learning experience as an added advantage
  • Hands on experience in writing code
  • Creativity and innovative thinking
  • Experience in Python and in ML libraries, e.g. from Scikit-Learn (e.g., Pandas, Matplotlib, Seaborn etc.)
SA‑2021‑05

Semantic Analysis of Text in African Languages

[ Project description | Close ]

The growth in both user-generated content and officially published content in multiple Bantu languages has created a need to extract insights from such data. This requires a deep semantic analysis of various grammatical components. Though there is a proposed architecture for insights extraction, further investigation is required in order to produce an end-to-end implementation. The key stumbling blocks are in developing generalizable models for the processes of morphological analysis, part-of-speech tagging, and sentiment analysis.

Key avenues of investigation include: the use of FastText to extract sub-word information for morphological and sentiment analysis; the fit-for-purpose of approaches such as Morph2Vec for the morphological analysis of agglutinative languages, generally; applying lexicon-retrofitted word vectors to solve the problem of noun semantics in Bantu languages; applying the noun class ontology to extract logical representations of the semantics of noun class-based grammar forms; and generalizing all the above to at least five Bantu language zones.

Requirements

  • Graduate students in computer science, engineering, mathematics, or a related field
  • Strong background in machine learning, deep learning, and applied statistics
  • Experience in Python and in ML libraries
  • NLP experience is preferable
SA‑2021‑06

Clinical Natural Language Processing for Digital Pathology

[ Project description | Close ]

Global cancer registries around the world employ expert human coders to label pathology reports using the appropriate International Classification of Disease for Oncology (ICD-O) codes. This manual process results in a significant lag in reporting of cancer statistics. Several automated approaches have been proposed for labelling cancer reports, including rule-based information retrieval (IR) and natural language processing (NLP) techniques. Rule-based systems are typically not generalizable across cancer domains and struggle with variability in report structure. Recent advances in the application of machine learning and deep learning models for natural language processing have reported performances that comprehensively outperform rule-based solutions. In fact, it has been demonstrated that convolutional neural networks (CNNs) using word embeddings consistently outperform classical machine learning approaches using term frequency-inverse document frequency (TF-IDF) for IR and classification of pathology reports by primary tumour site.

In this project, hierarchical and temporal CNNs and graphical convolutional networks will be used to create models for predicting ICD-O codes for free text breast cancer pathology reports.

The envisioned system for automated inference of the ICD-O (international classification of diseases for oncology) codes for individual cancer patients using information in their pathology reports, initially with a focus on breast cancer. The cancer coder is expected to provide the following advantages over the current state-of-art, that is, human coders: i) rapid speed with a potential for realtime coding, ii) high throughput processing capacity, iii) increased performance (sensitivity/ specificity), iv) consistent/ less subjective coding. It is anticipated that the system will take as input pathology reports of individual patients and provide two levels of output: i) the most probable corresponding ICDO code for a given report including topography and morphology and, ii) a ranked list of alternative ICD-O codes at different levels of confidence or likelihood.

Intern responsibilities

  • Conduct research in the field of machine learning, deep learning, and AI
  • Develop generalizable and scalable AI and conversational architectures and systems applied to addressing the “big” challenges in Africa
  • Efficiently implement algorithms and run experiments on real data (either small, big or zero data)
  • Build information machines and smart applications that enable humans to make better decisions and perform more efficiently in their tasks and allow for their swift integration into industry software components

Requirements

  • Graduate students (MSc or PhD) in computer science or related areas such as EE, Mathematics, etc. Outstanding senior BSc undergraduate students may also apply
  • Strong background in machine learning, deep learning and applied statistics
  • Hands on experience in writing code
  • Creativity and innovative thinking
  • Experience in Python and in ML libraries, e.g. from Scikit-Learn (e.g., Pandas, Matplotlib, Seaborn etc.)
  • A publication record in top-tier conferences and journals would be beneficial

Topics at the Africa Lab
in Nairobi

Ref. code
Project description
K‑2021‑01

Detecting Imbalanced Randomizations in RCT Studies

[ Project description | Close ]

RCTs are the gold standard for measuring casual effects of an intervention as well as a fundamental component of the scientific method. Random assignment to “treatment” and “control” groups is critical for correct follow-up analysis. However, due to difficult survey design, poor training – or just bad luck – it is possible that that some sub-population is over/under represented in one of the groups. This imbalance is typically tested for using standard statistical techniques that look across each feature individually (i.e. age, gender, income, education). However, to our knowledge, there is no systematic process in place to test for an imbalance across all sub-populations. For example, the two groups may look similar across representation of age and education, separately. However, one group may have a much higher presence of > 50 year olds with less than a high school education.

Our scanning methods will hold RCT assignment to a higher standard of scrutiny than previously possible. Detecting imbalanced assignment does not necessarily invalidate previous conclusions from the study. However, researchers should be given as many tools as possible to check their assumptions.

Potential business implications include tech companies that regularly do (very large scale) A-B testing on their customers/web-users.

K‑2021‑02

Bayesian Network Structure Learning Through Prediction Bias

[ Project description | Close ]

Bayesian Networks are graphical models that show how features of a dataset may be causally connected to each other (Wikipedia). This representation gives investigators the ability to perform ‘what-if ’ analysis on their datasets because changes to one feature ’s distribution is clearly mapped out to the rest of the data. Learning these connections between features (i.e. structure learning) from data is an open problem in Machine Learning.

Naïve Bayes is a simpler form of Bayesian Networks (Wikipedia). This project hypothesizes that the restricted, simpler form of Naïve Bayes induces a predictive bias on some subset of the records. Bias Scan will be used on these predictions to identify the combination of feature values (i.e. subset of data records) with systematically incorrect predictions. This information is then used to inform/update the graph structure of the network. Iterations on this procedure may result in a reasonable, data-driven network structure.

Applicant would need previous experience in causal inference and graphical models.

K‑2021‑03

Dice-Scanner: Group-level, Black-box Model Explainability

[ Project description | Close ]

We will extract insights from the predictions of black-box predictive models with an emphasis on detecting patterns affecting a group of records. This is in contrast to existing methods which typically focus on all records (global explainability) or individual records (local explainability). Group-level explainability balances the naïve assumptions of global methods with broader representation of pattern than local methods.

DICE-Scanner was recently developed within IBMRA and its goal is to identify a group of records such that their Deltas of their Individual Conditional Expectation Plots all have a larger than-expected-value. (ICE plots are explained here) DICE-Scanner has been/will be used to understand which mothers are more at risk of experiencing neonatal mortality in Sub-Saharan Africa. The goal of this project is to bring DICE-Scanner to a much larger machine learning audience by showing insights (with a large emphasis on intuitive graphics) across a large number of domains where predictive models have been previously trained. In addition to these insights, we will promote the importance of understanding black-box models at a group of records level.

Multiple follow-up projects are available including an option to extend LIME from local-only to a group level version. Furthermore, this may also have follow-up connections to causality.

A candidate with interest and experience on Visualizations is desired.

K‑2021‑04

Subset Boosting Machines

[ Project description | Close ]

Automatic Stratification (name may change) is an optimization method that identifies a subset of data records (and their shared feature-values) that, as a group, have a higher-than-expected rate of outcomes as compared to the entire data set. By itself, AutoStrat can be used to help investigators explore their data for insights that are not readily available when looking only at averages of individual features.

This project proposes to use AutoStrat inside a boosting algorithm as an improved weak learner. Shallow decision trees are a typical example of a weak learner in a boosting context. We believe that “splits” identified by AutoStrat will be more comprehensive and less greedy than the “splits” made by trees. This could lead to increased accuracy with a fewer number of overall weak learners. In addition to smaller models, this could also aid in human interpretation of the resulting boosted model.

Desired Candidate should have some experience with commonly used boosting methods.

K‑2021‑05

Automatically Identifying Heterogeneous Treatment Effects in RECOVERY: The World ’s Largest COVID19 Randomized Control Trial

[ Project description | Close ]

RECOVERY is the world ’s largest COVID19 trial . IBM has worked in the past with the UK’s National Health Service leverage technology to accelerate findings without sacrificing transparency, patient involvement, or peer review for this trial. Our project would apply TESS (Treatment Effect Subset Scan) to this data in order to identify sub-populations of patients that have higher (or lower) responses to the interventions than their non-treated counterparts. Identifying and understanding these types of heterogeneous treatment effects are critical for a safer, more effective rollout of the successful treatments to the patient sub-populations that benefited the most (or avoiding those that suffered side-effects).

TESS is a disciplined form of sub-group analysis that scales well to large datasets and appropriately accounts for multiple-hypothesis testing that typically plagues aggressive sub-group analysis (aka “p-hacking”). IBM Research Africa specializes in these types of data-driven discovery methods that operate by efficiently scanning over the exponentially-many subsets of data records. These methods identify patterns that may not affect all patients and therefore may be too subtle when looking at global aggregates (or non-existent when looking at an individual patient).

This project would be coordinated over multiple IBM Research labs, scientists and interns.

K‑2021‑06

Deep Scanner: Masking Nodes for Higher Accuracy

[ Project description | Close ]

An important part of many neural network architectures is a bias term that is subtracted (or added) to all activations in the layer. In the ReLu activation function, this term can be thought as a threshold so that only activations higher than this bias term remain non-zero while all other nodes have their activations set to 0 and do not propagate further through the network. This bias term is learned through training data.

This project hypothesizes that a bias term t (i.e. threshold such that only activations > t remain) may exist for each individual input (and each individual node!) at test time. We identify this term by scanning over the activations created by an input and identifying the subset of them that are “most anomalous” as compared to the distribution of activations generated by all inputs. Once identified, we then only allow activations from within this anomalous subset of nodes to propagate further through the network. The others are set to 0. This may also be viewed as placing a dynamic filter on certain layers of the network.

An appealing aspect of this project is the straightforward metric by which we will measure results: predictive accuracy. “Does masking some nodes in layer L of a deep neural network increase its predictive accuracy?”. Additional metrics to explore include training smaller networks with fewer data points and model explainability.

K‑2021‑07

Deep Scanner: Mahalanobis Noise Induced Activation Changes (MANIAC)

[ Project description | Close ]

Deep Scanner typically identifies patterns in neural networks by identifying a subset of nodes that have higher-than-expected activations. MANIAC changes this formulation by scanning over the change in activation at each node when the input (image) had some form of noise added to it. Detecting patterns in neural network activations has previously been shown to detect adversarial attacks and detect generated content. We wish to extend these ideas to out-of-distribution detection more generally as well as implications for possible augmentations in self-supervised learning.

This is an ambitious project for a summer internship, but it builds on a larger body of existing work from IBM Research Africa including patents and a 2020 internship project that used a simpler noising algorithm: Fast Gradient Sign Method, FGSM.

FGSM noise targets the output layer of a network which is very susceptible to overfitting. This means it is possible to change activations in the output layer with minimal changes to the activations in the internal layers. Mahalanobis noise instead targets the penultimate layer and therefore claims to make more substantive changes to the representation space than FGSM. This project will quantify the changes in the representation space induced by different noise models with the goal of using these patterns of change to detect out of distribution samples (including the difficult task of new-class labels). Follow-up projects include using these insights to better train models on unlabeled data (self-supervised learning).

K‑2021‑08

TTX: Learning from Models Putting Theory into Practice

[ Project description | Close ]

The current COVID-19 pandemic has provided a powerful demonstration of the value of data and insights to inform decisions made at both national and household level. Many communities have been hit hard, and as decisions able made about how to “reopen" and approach the new normal it is critical that decision makers have tools to help them envision consequences for actions. In this internship we will evaluate a portfolio of models to assess their predictive power 7, 14, and 30 days into the future at various times since the start of the pandemic.

During the pandemic, data from many sources were made available to help develop contextual awareness about what was happening, and impact of interventions. Between these models and data, we will develop a distributed “table-top exercise” to assess an organization's readiness and flow of communication in a simulated pandemic. While outside the scope of the internship, the goal is that this exercise will be performed, and feedback generated from actual participants. Towards that end, sponsor users will be solicited to be part of the design process. Optimization and Learning will be critical elements of the activity, as they will be harnessed to support user understanding of what could be done, and inform what should be done.

This work will build upon:

  • A Platform for Disease Intervention Planning. Charles Wachira, Sekou L Remy, Oliver Bent, Nelson Bore, Samuel Osebe, Komminist Weldemariam and Aisha Walcott-Bryant. International Conference on Healthcare Informatics, 2020
  • A Global Health Gym Environment for RL Applications. Sekou L Remy, Oliver Bent. Proceedings of Machine Learning Research NeurIPS Competition & Demonstration Track Postproceedings, 2020

Preferably, applicant would have previous exposure to optimization or reinforcement learning techniques.

K‑2021‑09

Novelty Sampling from GANs in settings with large and small datasets

[ Project description | Close ]

Two of the most commonly used approaches in Deep Neural Generative Models are GANs and VAEs; these methods model the distribution of known samples (in-distribution). These models discourage out-of-distribution generation to avoid issues regarding instability and minimize spurious sample generation, limiting their novelty potential. Therefore, we need new approaches to enhance the creative ability of current deep neural generative models.

We want to explore refining techniques from the sample translation space to be able to generate novel output from small datasets. Further, we want to explore subset scanning capabilities across the generator's activation and the output space to search for novel samples based on anomalous activations during generation time. Using the activation space will enable us to use any off-the-shelf GAN/VAE that can generate images, audio, text, or molecule structures (e.g., drug discovery applications). Furthermore, our approach doesn't require labeled "creative" samples or specific training to filter novel selections from the generation process.

We also want to explore visualization techniques to visualize how the subset of nodes interact in the generation process and what subset distribution can be used to understand what process makes novel a sample.

A candidate with interest and experience on Visualizations is desired.

References

[0] Das, P., Quanz, B., Chen, P.Y. and Ahn, J.W., Toward A Neuro-inspired Creative Decoder. IJCAI 2020.
[1] Cintas, C., Speakman, S., Akinwande, V., Ogallo, W., Weldemariam, K., Sridharan, S. and McFowland, E., Detecting Adversarial Attacks via Subset Scanning of Autoencoder Activations and Reconstruction Error. IJCAI 2020.
[2] Akinwande, V., Cintas, C., Speakman, S. and Sridharan, S.. Identifying Audio Adversarial Examples via Anomalous Pattern Detection. Adv ML workshop KDD.

K‑2021‑10

Cross-domain learning to address low-quality, small and multi-source challenges in tabular data

[ Project description | Close ]

Data-driven insight extraction, e.g., using machine learning algorithms, often requires large data to train data-hungry sophisticated models. However, the amount data collected for some of the critical global challenges, e.g., child mortality, is hardly enough and there are not ImageNet-like validation datasets. This is partly due to the amount of time and money required to collect such a dataset. Moreover, some events occur rarely thus impossible to acquire more instances. This project aims to design novel techniques to utilize multiple tabular data sources to improve understanding of a particular problem. Integration of multiple sources could be applied early in the pipeline (e.g., at data or feature level) or later after modeling is applied per each data source. Generation of samples and/or augmentation could also be employed to address small, low-quality data challenges. The project could benefit from our existing data-level linkage frameworks derived from our engagement with the Bill and Melinda Gates Foundation.

K‑2021‑11

Unsupervised learning of new class samples without/limited forgetting of previous classes

[ Project description | Close ]

The subset scanning techniques we developed so far achieved encouraging performance in detecting anomalous (e.g., out of distribution and fake) samples across multiple modalities, such as images and audio. One potential extension of this work is to identify if detected samples share peculiar characteristics and statistically determine if they could be treated as new class samples. A follow up work could be how to integrate the learning of the newly identified class without out significantly forgetting the previously trained samples. State-of-the-art methods on this problem focuses on supervised strategies, and this project could focus on unsupervised strategy and its comparison with existing supervised methods. A use case could be skin diseases classification as there are challenges specific to this new class detection problem.

K‑2021‑12

Automatic bias evaluation in academic and/or legal documents

[ Project description | Close ]

Bias detection is getting attraction across different research domains and plays a critical role in developing fair machine learning algorithms. While most validations have been performed on curated datasets in different modalities, (e.g., images and texts), there are not many works on documents such as academic and legal documents. Thus, this project extends our existing work on quantifying representation of skin tones (e.g., white vs dark skins) in dermatology textbooks. To this end, a parsing of these documents is required to segment different entities in a document (e.g., text, tables, and images), which could use IBM ’s corpus conversion service. Representation evaluation could be started on images, where skin images are filtered from other images in documents. Then segmentation of non-disease regions of skin images is applied, upon with Fitzpatrick skin indices are derived from individual typology angle (ITA) values. Preliminary results show that poor segmentation results from dark skin images and white images with black background. Thus, this project could aim for pre-processing, augmentation or post-segmentation techniques to address those challenges. The bias detection framework could also be extended to age and gender beyond skin tones and dermatology.

K‑2021‑13

Economics of climate change

[ Project description | Close ]

Many carbon pricing methods / policies exist. Given existing policies, how should companies operate such that they minimize their carbon costs, while not affecting their bottomline? Conversely, how to design these ‘carbon tax ’ policies such that negative economic implications are minimized and companies feel encouraged to reduce their emissions actively?

K‑2021‑14

Supply chain resilience

[ Project description | Close ]

How must companies operate their supply chains such that they reduce their emissions, while also becoming resilient to weather / climate disruptions? Again, many resilience-building techniques exist (supplier selection, inventory redundancy etc.). How to build a unified framework that recommends the best ‘resilient’ policy?

K‑2021‑15

Lifecycle-aware supply chains

[ Project description | Close ]

Supply chains typically focus on ‘here and now ’ benefits to determine how to operate optimally. However, this perspective might change if we a take a lifecycle perspective. For example, consider transportation of beef (A) vs. raw corn (B): In A, if beef is transported from processing unit to the store over only over 10 km, and in B the same mass of raw corn is transported over 100 km, the ‘here and now ’ outlook would rate B as more polluting than A. However, from a lifecycle perspective - beef being a very carbon-intensive product, it has already accrued a large amount of emissions - A is actually a lot more polluting. What is the best way to introduce these ideas to the current supply chain paradigm?

K‑2021‑16

Emission implications of traditional vs. e-commerce

[ Project description | Close ]

Several studies exist that compare the emissions of these two channels. However, these tend to be region specific, and their recommendations are limited to that extent. Is a unified approach possible that makes universal recommendations to help decide the best mode in which a company should operate?

Topics at the Europe Lab
in Zurich

Ref. code
Project description
Z-2021-01

Blockchain

[ Project description | Close ]

Blockchain Core and Application Development: We are looking for highly motivated interns to join our advances research and development activities in the area of Blockchain Security & Applications. Ideal candidates are familiar with blockchain, security, cryptography and distributed systems technology. Depending on their background, candidates may contribute to extensions of Hyperledger Fabric or work on blockchain applications and on extending trust to the physical world using the concept of crypto anchors requirement.

Requirements

Candidates should have experience with DevOps and standard coding practices.

Z-2021-02

Automated Maintenance of a Security Vulnerability Database

[ Project description | Close ]

New security vulnerabilities get published on a daily basis. To keep track of the vast amount of information, the Common Vulnerabilities and Exposures (CVE) reporting convention has been developed to assign unique CVE identifiers to newly published security vulnerabilities. CVE has become the standard for classifying vulnerabilities and is supported by the major software vendors.

There are organizations that maintain their own security vulnerability databases that are optimized for the organizations’ special needs. Benefits of maintaining their own databases are speed and completeness. New vulnerability documents can be added very quickly, and one does not have to wait until the vulnerability is assigned a CVE identifier. Furthermore, there are reports stating that a large number of vulnerabilities never got assigned a CVE identifier and hence does not show up in the official CVE database maintained by the MITRE organization.

While maintaining its own vulnerability database offers some obvious advantages, it also creates some new challenges. An initial classification of vulnerability documents is needed that allows users to quickly find documents relevant for their environments. Additionally, after a while CVE identifiers may become available for vulnerabilities that previously got added to the vulnerability database. This means that available documents may have to get updated with the CVE identifier or be replaced with new documents that already have the CVE identifier included. As a consequence, providing the most recent vulnerability information and keeping the database up to date become challenging tasks that currently involve quite some manual work.

The objective of the proposed project is to automate the vulnerability database maintenance tasks as much as possible. This will ensure the timely availability of new vulnerability documents while guaranteeing the proper management of previously added information.

Natural Language Processing technology will be used to analyze and classify documents. Similarity analysis techniques are expected to help identify documents that relate to the same vulnerability. The project will be performed on an existing security vulnerability database which will allow us to measure the impact this project will have on the quality and timeliness of the vulnerability data.

Requirements

  • Very good IT security and programming skills
  • Interest in Natural Language Processing (NLP)
  • Motivation to learn and apply new technology
Z-2021-03

Automating AI for Advanced Data-Driven Material Manufacturing

[ Project description | Close ]

Nowadays, during the processes of manufacturing materials a large amount of data is created from collection of different sources including processing conditions, quality checks, and measurements of properties. The information contained in the data is not often exploited to its full potential because important correlations are often hindered under the complexity. Machine learning algorithms help to extract knowledge out of complex data leading to useful insights that would not be easily deduced. However, depending on the data structure, fine tuning of the model parameters could be time consuming.

The goal of this project is to design a tool to automatise the selection of the hyper parameters describing the machine learning architecture allowing a faster finer tuning allowing to better fit the model to the structure. The project will leverage assets build by other team members, that need to be adapted for the goal.

Requirements

  • Strong background in machine learning and programming
  • Team work
  • Basic knowledge in material science
Z-2021-04

ML-enhanced search of transition states

[ Project description | Close ]

The characterization of transition states is of considerable importance for understanding chemical reactivity. To determine them computationally, a prerequisite is the knowledge of the atom mapping - i.e., knowing where every precursor atom ends up in the products. Due to the combinatorial number of possible ways to map atoms, hardly any existing computational approach is able to determine reliable atom mappings except for small molecular systems, and this remains largely a manual task. It is only recently that machine-learning algorithms have been developed to tackle this task.

The goal of this project is to combine the recently-developed AI models with new approaches to automate the determination of transition states, starting from the mere specification of the chemical precursors and products.

Requirements

  • Strong background in programming
  • Good understanding of physics and chemistry
  • Good understanding of machine learning
Z-2021-05

Advanced Data-Driven Material Manufacturing

[ Project description | Close ]

Nowadays, in the field of materials, a large amount of data is created from experiments and simulations. The information contained in the data is not often exploited to its full potential because important correlations are often hindered under the complexity. Machine learning algorithms help to extract knowledge out of complex data leading to useful insights that would not be easily deduced.

The goal of the project is to explore routes to accelerate materials discovery via a data-driven methods. In particular, as a prototypical case, the idea is to design a machine-learning based tool able to predict properties of complex materials starting form data of simpler materials. The project will make use of public databases.

Requirements

  • Background in machine learning and programming
  • Knowledge in material science
Z-2021-06

Extracting chemical information from the chemical literature

[ Project description | Close ]

Recently, we designed several machine-learning algorithms to predict the precursors or products of chemical reactions and suggest the necessary steps to execute reactions in the laboratory. The data necessary to train these models was extracted from millions of patents. Articles published in the chemical literature describe millions of additional chemical reactions. Accordingly, they have the potential to improve the performance of the algorithms. However, they are usually reported in a different format, which prevents the direct application of the tools designed to extract information from patents.

The goal of this project is to design new tools to extract chemical information from the text and images of articles published in the chemical literature.

Requirements

  • Strong background in natural language processing or computer vision
  • Basic chemical knowledge
Z-2021-07

AI for Civil Engineering Applications

[ Project description | Close ]

Aging and deteriorating infrastructure (bridges, tunnels, etc.) is a struggle for companies around the world. With the cost of physical inspections and continued maintenance rising all the time, these companies need a better way to manage their current infrastructure. Indeed, roughly 50 billion dollars and two billion civil-engineering labor hours are spent annually monitoring bridges for defects. Asset managers need to identify elements to be repaired or replaced quickly, minimizing the lifetime cost of maintenance of their asset protfolio, without any compromise on safety and regulations. However, correct risk assessment and prioritization become a challenge when inspecting a single bridge takes from days to months.

In this project the candidate will contribute to develop our portfolio of solutions based on machine learning and deep learning methods to accelerate inspection of civil engineering infrastructures, including bridges, tunnels and buildings from days to hours. This portfolio includes methods in computer vision (to detect defects, measure them, and assess risk), time series (to analyze sensor data and detect anomalies) as well as text (from maintenance report documents). Results of this work are aimed to be integrated in IBM product, such as Maximo, hence with high impact.

The candidate will work at the IBM Research – Zurich Laboratory, in the AI Automation group, having the opportunity to work in a unique corporate environment, acquire experience in several areas, publish in top international conferences, learn how to patent innovative ideas, as well as deal with clients on real business cases. Our group consists of a highly motivated team of researchers that is willing to lead and help the candidate to successfully complete the challenges of the proposed task. We provide an HPC and Cloud infrastructure equipped with recent variants of GPUs. Developing, maintaining, and optimizing code for scalability becomes a real challenge!

Requirements

  • 3+ years of proved programming experience in C/C++ and/or Python;
  • Practical experience with Machine Learning and/or Deep Learning frameworks (e.g., scikit-learn, TensorFlow, PyTorch) is a plus;
  • Proficient in UNIX/Linux;
  • Outstanding university track record, with background in Computing, Machine Learning, Mathematics, Statistics, or equivalent fields;
  • Ability to speak and write in English fluently;
  • Self-motivated with passion for technology and innovation.
Z-2021-08

Development of Interpretable AI Methods for Computational Biology

[ Project description | Close ]

Recent advances in deep learning have pushed the boundaries of feasible solutions of large-scale real-world problems such as image classification, object detection, natural language processing, and video classification. The availability of a large amount of annotated data and high-performance computing infrastructures are key factors that fuel that trend. However, there are still some common challenges if the latest approaches should be turned into solutions for clients. Data capturing, annotation, and quality management becomes essential before applying deep learning techniques.

In this project, we propose to develop an end-to-end prototype of a deep learning system that can look at time series of images in order to classify events or to visualize current evolution and predict future progress. Additionally, temporal sparsity of the system shall be explored, such that only a few input images are enough to perform predictions.

The candidate will work at the IBM Research – Zurich Laboratory, in the AI Automation group, having the opportunity to work in a unique corporate environment, acquire experience in several areas, publish in top international conferences, learn how to patent innovative ideas, as well as deal with clients on real business cases. Our group consists of a highly motivated team of researchers that is willing to lead and help the candidate to successfully complete the challenges of the proposed task. We provide an HPC and Cloud infrastructure equipped with recent variants of GPUs. Developing, maintaining, and optimizing code for scalability becomes a real challenge!

Requirements

  • 3+ years of proved programming experience in C/C++ and/or Python;
  • Practical experience with Machine Learning and/or Deep Learning frameworks (e.g., scikit-learn, TensorFlow, PyTorch) is a plus;
  • Proficient in UNIX/Linux;
  • Outstanding university track record, with background in Computing, Machine Learning, Mathematics, Statistics, or equivalent fields;
  • Ability to speak and write in English fluently;
  • Self-motivated with passion for technology and innovation.
Z-2021-09

Record Linkage

[ Project description | Close ]

Record linkage is an essential part of nearly all real-world systems that consume structured and unstructured data coming from different sources because it allows to combine data that do not have a common key for connecting the records. We have developed a system that links records to records in a reference database containing more than 150 million records. We use locality sensitive hashing to identify a set of candidates and subsequently apply a scoring function to return the best matches. The system works in near-real times and is being used in production. During development, test cases are available to verify that individual components work correctly.

The objective of this project is to develop a monitoring system and to optimize the current system. The monitoring system should be able to identify whether a new version that has been developed performs as efficiently as the previous version and discrepancies in the results between the system which depending on the update may or may not be desired. Based on the monitor system, the algorithms should be optimized. Multiple venues for optimization are possible:

  • replace the in-memory database with a disk-based representation thereof to reduce memory requirements
  • together with the team replace the matching algorithm with an algorithm that uses machine-learning
The candidate will work at the IBM Research – Zurich Laboratory, in the AI Automation group, having the opportunity to work in a unique corporate environment, acquire experience in several areas, publish in top international conferences, learn how to patent innovative ideas, as well as deal with clients on real business cases. Our group consists of a highly motivated team of researchers that is willing to lead and help the candidate to successfully complete the challenges of the proposed task. We provide an HPC and Cloud infrastructure equipped with recent variants of GPUs. Developing, maintaining, and optimizing code for scalability becomes a real challenge!

Requirements

  • Strong background in computer science, distributed computing, and some knowledge in machine learning
  • C++, Python
  • Experience with cloud infrastructures such as Docker and possibly Kubernetes
  • Interest in solving complex problems and motivation to work on a production system
  • Excellent verbal and written English skills
Z-2021-10

Explainability of Graph Convolutional Networks and Temporal Evolution of Graph Convolutional Network Models

[ Project description | Close ]

Knowledge graph embedding methods learn embeddings of entities and relations which can be used for downstream machine learning tasks such as link prediction and entity matching. Various Graph Convolutional Network methods have been proposed which use different types of information to learn the features of entities and relations. We have developed an attention-aware relational Graph Neural Network that predicts missing links in a graph. The data sources for our models are enterprise knowledge graphs built from large and rich datasets obtained from various business information systems.

The link prediction information inferred from our Graph Neural Network is used as part of a recommender system to suggest work items to different teams. However, the output from the neural network does not directly allow to identify why the missing link and thus the work item was suggested. This information would be of importance for multiple reasons: i) in order to identify how to approach the task and ii) as motivation to demonstrate that the system did not come up with a random suggestion.

Multiple approaches are possible, ranging e.g. from perturbation-based methods, Deep Taylor Decomposition, or example-based explanations. Essentially, the goal is to identify the relevant information-context in the graph and possibly in the structure of the neural network that allows to derive human understandable concepts that can be used as part of the explanation.
We also have an older, rule-based implementation of the recommender system that can be leveraged. Based on the recommendations made by that rule-based system, we may use these rules to explain some of the recommendations made by the new Graph Neural Network-based recommender system. Although not all recommendations will probably be explainable by the component to be developed, we expect that at least a significant part will be.

Another aspect is the ingestion of new data and thus to retrain the graph incrementally (based on temporal events). Some prior work has been done in the domain, e.g. EvolveGCN, which uses Recurrent Neural Networks to generate a model to evolve the Graph Convolutional Network parameters. Challenges here are how to model the different temporal states and features of the nodes and edges and derive a model that reflects these dynamic characteristics so that it can be used to update the parameters of the Graph Convolutional Network.
The candidate will work at the IBM Research – Zurich Laboratory, in the AI Automation group, having the opportunity to work in a unique corporate environment, acquire experience in several areas, publish in top international conferences, learn how to patent innovative ideas, as well as deal with clients on real business cases. Our group consists of a highly motivated team of researchers that is willing to lead and help the candidate to successfully complete the challenges of the proposed task. We provide an HPC and Cloud infrastructure equipped with recent variants of GPUs. Developing, maintaining, and optimizing code for scalability becomes a real challenge!

Requirements

  • Strong background in computer science and in particular machine learning/deep neural networks
  • Experience with Python and machine learning packages such as PyTorch, Tensorflow, Transformers
  • Interest in solving complex problems and motivation to produce a deployable solution
  • Ability to work proactively and independently
  • Excellent verbal and written English skills
Z-2021-11

Toward ML-based database tuning and optimization

[ Project description | Close ]

Databases represent the foundation of virtually any large scale online service. Their performance is, therefore, crucial for the overall performance of the applications that they serve. Unfortunately, getting the best performance out of a database for an application is far from being a trivial task. A prominent source of complexity lies in the fact that modern databases expose a plethora of tuning knobs, whose proper tuning is fundamental to achieve good performance. Identifying the correct setting for these several knobs, however, is a daunting task, which is typically tackled with a trial and error approach, which often identifies highly sub-optimal configurations. The complexity of the problem is exacerbated in a cloud consolidated environment, where performance also depends on workloads collocation, and workloads can dramatically vary over time.

The goal of this project is to advance the state-of-the-art in autonomous database systems by designing and building an ML-based system that self-tunes a database to deliver the best performance. The specific use case for the project is FoundationDB, i.e., an open-source distributed data-base that powers some business-critical IBM Cloud database services. The project has a strong research component, since it poses challenges that have not been addressed by existing approaches. First, FoundationDB exposes hundreds of tuning knobs, which lead to a very large configuration space. Second, the target workload varies in time both in intensity, as well as in transactional mix. Third, the proposed approach must be robust, i.e., must provide predictable performance, since the long-term goal is to apply it to the IBM Cloud production environment. To tackle these challenges, we use an ml-based approach.

We are inviting applications from students to conduct an internship project at the IBM Research lab in Zurich on this exciting topic. The research focus will be on advancing the state-of-the-art in AI-based database tuning and optimization. It also involves interactions with several researchers focusing on various aspects of the project and with the IBM Cloud data services team. The ideal candidate should have experience in distributed systems, databases, and Machine Learning, and have strong programming skills (C++, Python). Hands-on experience with distributed database systems or ML frameworks is a bonus but not necessary.

Z-2021-12

Elastic Ephemeral Storage Services for Serverless Computing

[ Project description | Close ]

Serverless computing is a cloud-computing execution model in which the cloud provider dynamically manages the allocation of machine resources. As a cloud service, it is becoming increasingly popular due to its high elasticity and fine-grain billing. Serverless platforms like AWS Lambda, Google Cloud Functions, IBM Cloud Functions, or Azure Functions enable users to quickly launch thousands of light-weight tasks (as opposed to entire virtual machines), while automatically scaling compute, storage and memory according to application demands at millisecond granularity. While serverless platforms were originally developed for web microservices, their elasticity advantages in particular make them appealing for a wider range of applications such as interactive analytics and machine learning.

To enable Serverless as the new paradigm to efficiently serve any type of workload, including complex multi-stage computations, the efficient handling of intermediate (ephemeral) data becomes key. Current solutions rely on key-value stores like Redis, which are typically unable to auto-scale their resource consumption. To overcome this limitation, we aim at extending a given Serverless framework (Knative) with a highly elastic, high performance data store for ephemeral data. Choosing the Apache Crail data store, we work on adding resource elasticity capabilities and on integrating it with Knative.

Our Apache Crail based prototype already supports dynamically adding and removing storage resources according to a serverless applications current demand. To allow scaling, it currently implements its own proprietary resource scaler, which decides whether to add or remove Crail data nodes. As such, in a Knative environment, Apache Crail starts and terminates Kubernetes PODs accordingly. Proposed as a project here, we would like to explore other techniques to do such autoscaling. One possible way is to run Apache Crail directly as a Knative service and define CRDs to monitor memory consumption and let the autoscaler add and remove datanodes based on these CRDs.

The research focus will be on exploring techniques for efficient autoscaling of ephemeral data store services in a serverless environment. It also involves interactions with several researchers focusing on various aspects of the project. The ideal candidate should be well versed in distributed systems, and have strong programming skills (Java, C++, Python). Hands-on experience with distributed container orchestration systems (Kubernetes) and serverless environments (KNative) would be desirable.

Z-2021-13

Artificial General Intelligence: Lifelong Learning Challenge

[ Project description | Close ]

We are moving toward a new paradigm that of general artificial intelligence. Here, we mainly focus on lifelong continuous learning and analogical reasoning aspects Traditional neural networks require enormous amounts of data to build their complex mappings during a slow training procedure that hinders their abilities for relearning and adapting to new data. Memory-augmented neural networks(MANN) enhance neural networks with an external and explicit memory to overcome these issues. Access to this external memory, however, occurs via soft read and write operations involving every individual memory entry, resulting in a bottleneck when implemented using the conventional von Neumann computer architecture. To overcome this bottleneck, a promising solution is to employ a computational memory unit as the external memory performing analog in-memory computation. However, a key challenge associated with in-memory computing is the low computational precision resulting from the intrinsic randomness and device variability that can be addressed by levering robust representations and transparent manipulations such as those used in hyperdimensional computing paradigm. In such hybrid machine learning model, there are several challenges that need to be overcome at both algorithmic and hardware levels to realize lifelong continuous learning engines. This mainly includes exploring and developing efficient methods for compressing external memory contents and fast retrieval as well as in-memory computation (comparison, decomposition, or reasoning).

Z-2021-14

Blockchain Core and Application Development

[ Project description | Close ]

Computational memory is very appealing for making energy-efficient deep learning inference hardware, where the neural network layers would be encoded in crossbar arrays of memory devices (paper).

We are inviting applications from students to conduct their Master thesis work or an internship project at IBM Research – Zurich on this exciting new topic. The work performed could span low-level hardware experiments on phase-change memory chips comprising more than 1 million devices to high-level algorithmic development in a deep learning framework such as TensorFlow or PyTorch. It also involves interactions with several researchers across IBM research focusing on various aspects of the project.

The ideal candidate should have a multi-disciplinary background, strong mathematical aptitude and programming skills. Prior hands-on experience with implementation in deep learning frameworks is recommended.

Z-2021-15

Deep learning incorporating biologically-inspired neural dynamics

[ Project description | Close ]

Neural networks are the key technology of artificial intelligence that has led to breakthroughs in many important applications. These were achieved primarily by artificial neural networks that are loosely inspired by the structure of the brain, comprising neurons interconnected by synapses. Meanwhile, the neuroscientific community has developed the Spiking Neural Network model that additionally incorporates biologically realistic temporal dynamics in the neuron structure. Although ANNs achieve impressive results, there is a significant gap in terms of power efficiency and learning capabilities between deep ANNs and biological brains. One promising avenue to reduce this gap is to incorporate biologically-inspired dynamics and synaptic plasticity mechanisms into common deep-learning architectures. Recently, the IBM team has demonstrated a new type of ANN unit, called a Spiking Neural Unit (SNU), that enables us to incorporate the SNN dynamics directly into deep ANNs. Our results demonstrate competitive performance, surpassing state-of-the-art RNNs, LSTM- and GRU-based networks.

Furthermore, in another recent work on Online Spatio-Temporal Learning (OSTL), we provide a learning framework based on biological insights. OSTL provides an alternative to the backpropagation through-time (BPTT), enabling a new efficient approach to deep learning of temporal data without the BPTT’s requirement for unrolling through time.

In this project, we aim to investigate on-line learning approaches in conjunction with biologically-realistic dynamics in deep networks. Specifically, the focus will be on incorporating SNUs into large-scale deep ANNs for applications such as speech recognition, image understanding or text processing. The main task will be to explore further online learning algorithms. These developments will allow us to assess the impact of biologically realistic aspects on important AI tasks, and indicate how to close the gap between deep learning and biological brains. The IBM team will provide extensive scientific guidance and access to a powerful GPU cluster.

Requirements

  • Experience with TensorFlow or PyTorch machine-learning framework
  • Strong programming skills in Python
  • Strong analytical and problem-solving skills
  • Excellent communication and team skills