Background: Internal displacement and the Sustainable Development Goals (SDGs)
In 2015, UN Member States unanimously adopted the Sustainable Development Goals (SDGs) and, in doing so, pledged that “no one will be left behind.” The more than 40 million men, women, and children displaced within their countries of residence as a result of conflict, disasters, development projects and other causes are among those most likely to be excluded from social and economic opportunities for development. Many face increased vulnerability to further cycles of displacement when durable solutions that reduce the risks they face are not found.
Displacement is commonly addressed as a humanitarian problem, but it is also a sustainable development challenge. It is closely associated with poverty, inequality, insecurity, environmental degradation, exposure to hazards and the vulnerability of populations whose governments are unable or unwilling to protect them. In fact, it is often both a cause and a consequence of such issues. Livelihoods, economic activity, and capacities that strengthen communities’ resilience are seriously compromised when people are forced to flee their homes as a result of a crisis.
Internal displacement poses a large – and growing – problem with more people displaced today than in years past by conflict, violence, and disasters. It will be difficult for the Member States to make progress on the SDGs without addressing the associated challenge of internal displacement. One factor making this challenge more difficult is the incomplete picture of internal displacement due to the inability to identify and account for all displacement events around the world in a systematic manner. This picture of internal displacement can come into clearer focus by leveraging data science, and by using techniques that have already proven effective in addressing similar challenges, such as disease detection and surveillance.
We challenge you to:
Create a tool that will be used to monitor internal displacement as a result of natural-hazard induced disasters, armed conflicts, generalized violence and development projects. The tool will make the monitoring of internal displacement more efficient and comprehensive. It will also provide the humanitarian community with an easy way to extract and analyze facts from any type of documents (news, field reports, social media and any other relevant source).
1. Filtering and tagging
The filtering step should be a binary classification of the URLs contained in the input dataset. It should exclude:
‣ documents not in English
‣ broken URLs
‣ documents not reporting on human mobility (see example below).
Only the information tagged as “Relevant” should be retained for the next analysis steps
Relevant documents should be classified in three categories representing different triggers of displacement:
‣ “Conflict and violence”,
The tagging should be based on the training dataset provided. The training dataset is extracted from the Global Internal Displacement Database. It consists of a list of URLs to documents (mainly web pages and pdf documents) already tagged as either “Disasters” or “Conflict and violence” by IDMC’s team of monitoring experts. “Other” should include all the documents not tagged as either “Disasters” or “Conflict and violence” but containing relevant information. Tags are not mutually exclusive as the same document can report on multiple triggers of displacement.
Once integrated into the IDMC’s information system, the tool should be able to learn from new documents as they are added to the Global Internal Displacement Database (through online learning or by updating the training dataset).
2. Natural Language processing analysis
Using Natural Language Processing algorithms the #IDETECT should automatically extract “facts” from the documents. A fact is a displacement figure reported in the document. Each fact should include:
‣ The date of publication of the document the fact is extracted from;
‣ The location where displacement happened;
‣ The reporting term used in the document (see table below);
‣ The reporting unit. There are different reporting units used to identify displaced population should be grouped in two main reporting units: people and households (see table below); and
‣ The displacement figure (i.e. the number of people/households reported displaced).
Note: Multiple facts can originate from the same document.
3. Visualization and quantitative analysis of facts
The tool should provide a platform to visualize and analyze facts:
Visualization: The visualization tool should allow analysts to dive into the data, explore the facts extracted using NLP and uncover new knowledge on internal displacement. The visualization tool can include:
‣ An interactive map to easily help the humanitarian community identify “hotspots” on the map. The map should possibly be browsable and should allow information analysts to explore trends as a function of time.
‣ A histogram to analyze trends in a selected region and time range.
In order to make sure that “no one will be left behind” the #IDETECT should provide a platform to analyze, compare and explore the displacement figures contained in the facts. Analysts should be able to select which facts and displacement figures are visualized based on location and time. This quantitative analysis tool will allow information analysts to go from the number of facts to the displacement figures in the facts.
From the displacement figure analysts should be able to go back to the document (URL) reporting that displacement figure and possibly visualize the excerpts of the documents where the information was reported.
Download the graphical representation of the #IDETECT challenge workflow here.
‣ A web link (URL) to a working (live) demo (suggested: GitHub Pages or BitBucket Pages, or similar).
‣ A repository of the original open source code, data files, and other electronic files (include GNU license). This package can be hosted in a public repository and should allow IDMC or UN Agencies to run the tool on local servers. Only original, open source work will be accepted. It is acceptable that your solution uses other existing open source libraries.
‣ A .csv or .xls answer file containing the analysis of the test dataset. One week before the submission deadline a test dataset will be uploaded on the Unite Ideas web page, it will consist of a list of URLs without tags. The answer file should contain the result of the analysis using the same algorithm developed for the challenge. This file will be used to evaluate the performance of the algorithm. The answers will be:
● the output of the filtering (i.e. whether the document contains relevant information or not)
● the tag(s) assigned to the document.
● the facts extracted from the document, in particular:
◦ the the displacement figure;
◦ the reporting unit (i.e. “people” or “households”)
◦ the location (at the country level using the Country Codes - ISO 3166)
‣ A brief document describing the functionalities, such as a user guide.
‣ A document describing the steps to maintain and update the tool with further features, such as an admin guide.
‣ The tool will be integrated into the information system of IDMC and should send data validated by the information analysts to the backend of the Global Internal Displacement Database. No restrictions are imposed on the technology used to tackle this challenge. However, teams should keep in mind that:
● the tool should be independent and run as external service
● it should send data in a standard format (.json, .csv or others ) with all the facts extracted from the documents selected by the information analysts
1. Input dataset ( ~ 79 MB)
The input dataset contains a list of URLs (~ 600 thousand articles) in English extracted from the GDELT GKG database. The input dataset should be used as input for the analysis. It consists of a .csv file with three fields:
‣ “GKGRECORDID”, a unique identifier of the document,
‣ “DATE”, the publication date-time of the document,
‣ “DocumentIdentifier”, a fully-qualified URL that can be used to access the document on the web.
The training dataset consists of a list of URLs tagged by our monitoring experts. We ask teams to mainly focus on “Conflict and violence” and “Disasters” and tag all the remaining documents as “Other”. The training dataset may contain documents not in English or URLs to videos which should be identified and not used for the training.
3. Test dataset
Prizes & Recognition
The winner (or group of winners) and the winning solution will:
‣ receive a letter of recognition from IDMC;
‣ be featured and referenced in the 2017 edition of Global Report on Internal Displacement and on the IDMC website;
‣ be invited to write a blog post on the project on the Unite Ideas website;
‣ be offered the opportunity to have an advisory role in the further development of the submitted code; and
‣ have the possibility to present the solution to partner organizations such as IOM, UNHCR, OHCHR and ICRC.
The judges will review the submitted solutions within three weeks after the closure of the challenge. Qualified submissions will be judged on a combination of the following criteria:
1. Usability - the ease of use and user-friendliness of the submission.
2. Accuracy - the degree the tagging and the information extraction of the tool are correct.
3. Insights - the degree the results and visualization by the tool are useful to detect displacement events and the corresponding displacement figures and presented in a creative manner.
4. Modularity - the ease of customization enabled by the solution.
5. Elegance - the elegance of the code written and the quality of documentation provided.
6. Documentation - the quality of the documentation provided alongside with the code.
‣ Justin Ginnetti, Head of Data and Analysis Dept, IDMC
‣ Leonardo Milano, Senior Data Scientist, IDMC
‣ Sarah Telford, project manager, HDX
‣ Luca Vernaccini, INFORM Index for Disaster Management
‣ Nuno Nunes, Global CCCM Cluster Coordinator, IOM
‣ Andrew Palmer, Coordinator of the Early Warning and Information Support Unit, OHCHR
‣ This challenge is open to the general public. Public, private, and academic organizations are also invited to take part.
‣ Only original, open source work will be accepted. It is acceptable that your solution uses other existing open source libraries.
‣ There are no limitations on the number of submissions per participant/participating team.
‣ The participants are required to agree on the terms and conditions.