-
Duchateau Fabien authored05b066d5
title: "Predihood: an open-source tool for describing and predicting neighbourhoods' environment"
tags:
- Python
- MongoDB
- data management
- neighbourhood
- prediction
- machine learning
authors:
- name: Nelly Barret
orcid: 0000-0002-3469-4149
affiliation: 1
- name: Fabien Duchateau
orcid: 0000-0001-6803-917X
affiliation: 1
- name: Franck Favetta
orcid: 0000-0003-2039-3481
affiliation: 1
affiliations:
- name: LIRIS UMR5205, Université Claude Bernard Lyon 1, Lyon, France
index: 1
date: 16 September 2020
bibliography: paper.bib
Introduction
Finding a real estate in a new city is a real challenge. We often arrive in a city we do not know, and finding the perfect living area becomes complex. Nearby public transportation on one hand, rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing a future neighbourhood. Our tool Predihood enables to define neighbourhoods with a set of indicators and predict their environment using supervised learning.
Statement of need
Several projects focus on qualifying neighbourhoods using social networks. For instance, the Livehoods project defines and computes dynamics of neighbourhoods [@cranshaw2012livehoods] while the Hoodsquare project detects similar areas based on Foursquare check-ins [@zhang2013hoodsquare]. Crowd-based systems are interesting but may be biased. DataFrance is an interface that integrates data from several sources, such as indicators provided by the National Institute of Statistics (INSEE), geographical information from the National Geographic Institute (IGN) and surveys from newspapers for prices (L'Express). DataFrance enables the visualization of hundreds of indicators, but makes it difficult to judge on the environment of a neighbourhood. There is no simple description of neighbourhood's environment.
Methodology
In order to describe in the most accurate way the environment of a neighbourhood, social science researchers have defined six environment variables, each with a limited number of values [@barretpredicting]. These six variables are the building type, the building usage, the landscape, the social class, the morphological position and the geographical position. As an example, the landscape can be evaluated as urban, green areas, forest or countryside while the social class have values from lower to upper. These variables are commonly accepted and easily understandable.
Predihood provides the following functionnalities:
- adding new neighbourhoods and indicators to describe them;
- predict the environment of a neighbourhood by configuring and using predefined algorithms;
- adding new predictive algorithms.
Adding new data
Neighbourhoods are represented as GeoJSON objects and include:
- a geometry (multi-polygons), which describes the shape of the neighbourhood;
- properties, with descriptive information (e.g., name, city postcode) and indicators which quantify the environment (e.g., number of restaurants, of bakeries, average income, unemployment rate, number of houses with a superficy above 250 m^2). These hundreds of indicators are used for predicting the values of environment variables.
To add new neighbourhoods, it is necessary to store them as GeoJSON and make them accessible by Predihood. Besides, some neighbourhoods have to be manually annotated (i.e., giving a value for each of the six environment variables).
The current version of Predihood is bundled with data from France using the mongiris project (in which unit divisions named IRIS stand for neighbourhoods). It includes about 50,000 neighbourhoods with 640 indicators, and 270 neighbouhoods were annotated by social science researchers (one to two hours per neighbourhood to investigate building and streets pictures, parked cars, facilities and green areas from services such as Google Street View).
Predicting environment
Machine learning algorithms need a dataset, as illustrated by Figure 1. In Predihood, a dataset is composed of the identifier of the neighbourhood (grey column), its indicators (yellow columns, showing only a subset) that have been normalized by density of population (green column) and optionnaly the assessment of social science researchers for the six environment variables (blue columns). The objective of Predihood is to fill automatically question marks for neighbourhoods that are not yet assessed.
To perform prediction, a selection process first selects subsets of relevant indicators. These subsets, called lists, contain from 10 to 100 indicators. Predihood provides a cartographic web interface based on Leaflet and Open Street Map, as shown in Figure 2. It enables to search for a neighbourhood and predict its environment by selecting an algorithm. The current version of Predihood currently includes 8 predictive algorithms from scikit-learn (e.g., Random Forest).
Adding new algorithms
Because the prediction of these environment variables is a complex task, testing different algorithms and comparing their results may help increase the overall quality. In order to facilitate this task, Predihood proposes a generic and easy-to-use programming structure for machine learning algorithms, based on Scikit-learn algorithms. Thus, experts can implement hand-made algorithms and run experiments in Predihood. Adding a new algorithm only requires 4 steps:
- Create a new class that represents your algorithm, e.g.
MyOwnClassifier
, and inherits fromClassifier
. - Implement the core of your algorithm by coding the
fit()
andpredict()
functions. Thefit
function aims at fitting your classifier on assessed neighbourhoods while thepredict
function aims at predicting environment variables for a given neighbourhood. - Add
get_params()
to be compatible with Scikit-learn framework. - Comment your classifier with the Numpy style in order to be able to tune it in the interface.
Below is a very simple example to illustrate the aforementioned steps. Note that your algorithm is automatically loaded in Predihood.
# file ./algorithms/MyOwnClassifier.py
from predihood.classes.Classifier import Classifier
class MyOwnClassifier(Classifier):
"""Description of the classifier.
Parameters
------------
a : float, default=0.01
Description of a.
b : int, default=10
Description of b.
"""
def __init__(self, a=0.01, b=10):
self.a = a
self.b = b
def fit(self, X, y):
# do stuff here
def predict(self, df):
# do stuff here
def get_params(self, deep=True):
# suppose this estimator has parameters "a" and "b"
return {"a": self.a, "b": self.b}
In addition, Predihood provides an interface for easily tuning and testing algorithms on a dataset, as shown in Figure 3. The left panel allows to select an algorithm and tune its parameters and hyper parameters, such as training and test sizes. On the right, the table illustrates the accuracies obtained for each list of indicators (generated during the selection process) and each environment variable. Results can be exported in CSV.
Mentions of Predihood
Our Predihood tool has been presented during the DATA conference [@barretpredicting]. Prediction results using 6 algorithms from Scikit-learn range from 30% to 65% depending on the environment variable, and designing new algorithms could help improving these results.
The project is available here: https://gitlab.liris.cnrs.fr/fduchate/predihood.
Acknowledgements
This work has been partially funded by LABEX IMU (ANR-10-LABX-0088) from Université de Lyon, in the context of the program "Investissements d'Avenir" (ANR-11-IDEX-0007) from the French Research Agency (ANR).