[M] paper JOSS

e60cd236 · Nelly Barret · bddd5a57 · e60cd236 · e60cd236 · e60cd236
Commit e60cd236 authored 4 years ago by Nelly Barret
--- a/paper.md
+++ b/paper.md
@@ -4,7 +4,7 @@ tags:
  - Python
  - MongoDB
  - data management
-  - neighborhood
+  - neighbourhood
  - prediction
  - machine learning
 authors:
@@ -27,235 +27,74 @@ bibliography: paper.bib
 ---
 # Statement of need
- contexte
+Finding a real estate in a new city is still a challenge. We often arrive in a city we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood.
-Finding a real-estate in a new city is still a challenge. We often arrive in a city we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood.
- un peu de related (comment les autres font pour qualifier un quartier, et dire que c'est limité à qques villes, partiellement biaisé, etc.)
- dire que des socios ont défini des VE (pas de tableau, juste liste des 6 + valeurs de social). Dire que ces vars sont "universelles" (commmonly accepted ?) 
-mais que le pb est de qualifier chaque quartier avec ces VE
- donc objectif de predihood est de prédire ces VE
+Some projects have been focused on qualifying neighbourhoods, such as Livehoods [@cranshaw2012livehoods] and Hoodsquare [@zhang2013hoodsquare]. The Livehoods project aims at defining and computing dynamics of neighbourhoods based on data gathered from social networks data while the Hoodsquare project detects similar areas based on Foursquare check-ins. Regarding a lot of papers about these challenges, our contribution differs on several points. Numerous works are limited to a few cities, some others introduce bias by using social networks and finally, the majority of works are focusing on life quality. Contrary to existing works, our approach works for a whole country (namely in France), is based on reliable and frequently updated sources and a social study and is focused on the environment of neighbourhoods. Moreover, each neighbourhood can be described by thousands of indicators. Besides, it is not possible to manually exploit these indicators and we don't have a global view of the principal characteristics of the neighbourhood. 
+In order to describe in the most accurate way the environment of a neighbourhood, social science researchers have defined six environment variables with a limited number of values for each one. These six variables are the _building type_, the _building usage_, the _landscape_, the _social class_, the _morphological position_ and the _geographical position_. As an example, the _landscape_ can be evaluated as _urban_, _green areas_, _forest_ or _countryside_ while the _social class_ have values from _lower_ to _upper_. These variables are commonly accepted and easy to understand and use. There is still a challenge about describing each neighbourhood in a whole country with these six variables. To tackle this challenge, our objective is to predict by supervised learning the environment variables whatever the neighbourhood.
 # Methodology
+In order to predict the environment of neighbourhoods, we have to gather data about neighbourhoods. There are mainly two types of data: the geometry which describe the shape of the neighbourhood and indicators that quantify the environment. For example, there are the number of restaurants, the average income or even the number of houses over 250 $$m^2$$. Predihood integrates such data for France by using [mongiris](https://gitlab.liris.cnrs.fr/fduchate/mongiris), an interface for querying French administrative areas. There are only data about French areas, but this can be extended to other countries.
- dire que pour la prédiction, il faut des données sur les quartiers (géometrie + indicateurs). Dire que predihood intègre des données sur la FR via mongiris 
+After gathering data, the next step is to assess some neighbourhoods **because of** the supervised learning approach. This manual assessment has been realized by social science researchers. This have been done by investigating Google Street View (building and streets pictures, parked cars, facilities and greens areas) and requires between one to two hours for a single neighbourhood. A total of 300 IRIS have been annotated, which will be used as training data.
-(url) mais que c'est modifiable pour d'autres pays;
- parler de l'évaluation manuelle (300 iris expertisés)
- mettre le ccontenu de la section features ensuite : dire d'abord que predihood permet de réutiliser les algos de scikit-learn / même implémentation, ex d'algos,
-# Mentions of Predihhod
- mettre la ref DATA et des captures (écran de parametrage, éventuellement écran avec carte pour illustrer la prédiction).
+In order to unify the view between assessed neighbourhoods and their indicators, datasets have been constructed. They look like Figure 1 and are composed of the code INSEE of the neighbourhood, its indicators that have been normalized by density of population and the expertise of social science researchers for the six environment variables.
+![Screenshot of Predihood](predihood-indicators.png)
-# Introduction
+It is now possible to predict the environment of any neighbourhood in France using our unified dataset. Because neighbourhoods are represented by hundreds of indicators, a selection process selects subsets of relevant indicators. These subsets are called _lists_ and contain from 10 to 100 indicators. They are used in the Predihood interface to predict environment.
+Predihood proposes a generic interface for tuning algorithms more easily. This interface is based on [Scikit-learn](https://scikit-learn.org/stable/) algorithms but can handle hand-made ones. To implement your own algorithm and test it on our dataset, follow these steps:
-Getting to a new city after a job transfer is always a challenge! We often arrive in a city that we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood. We present, Predihood, a new tool which facilitates the search and the choice of a neighbourhood.
+1. Create a new class that represents your algorithm and inherits from `Classifier`.
+2. Then, implement the core of your algorithm by coding `fit()` and `predict()` functions. The `fit` function aims at fitting your classifier on assessed neighbourhoods while the `predict` function aims at predicting environment variables for a given neighbourhood.
+3. Next, add `get_params()` and `sett_params()` to be compatible with Scikit-learn framework.
+4. Add your classifier in the `AVAILABLE_CLASSIFIERS` variable (stored in `./classifiers_list.py`).
+5. Finally, do not forget to comment your classifier with the Numpy style if you want to tune it.
-Keywords: Machine learning, Supervised learning by classification, Data cleaning, Information visualization, Neighbourhood study
+Below, there is a very simple example to illustrate the aforementioned steps.
-We have identified mainly three challenges:
-1. The definition of a neighbourhood can vary a lot. We have to define quite precisely what is a neighbourhood.
-2. Each neighbourhood can be described by thousands of indicators. Besides, it is impossible to manually exploit these indicators and we don't have a global view of the principal characteristics of the neighbourhood. We have to define a limited number of features to describe neighbourhoods.
-3. How to predict the environment of any neighbourhood for a whole country (in this case France)? We have to propose a supervised approach implemented into an interface
-# Methodology
-To tackle these challenges, social science researchers have defined six environment variables which aim at describing the environment of a neighbourhood. This six variables are represented in the table below:
-| Building type | Building usage | Landscape | Social class | Morphological position | Geographical position |
-|---------------|----------------|-----------|--------------|------------------------|-----------------------|
-| Towers | Housing | Urban | Lower | Central | Centre |
-| Housing subdivisions | Shopping | Green areas | Lower middle | Urban | North |
-| Housing estates | Other activities | Forest | Middle | Peri-urban | North-East |
-| Houses | | Countryside | Upper middle | Rural | East |
-| Mixed | | | Upper | | ... |
-These variables facilitate the comparison of neighbourhoods.
-In order to predict the environment of any neighbourhood in France, social science researchers have manually assessed 300 neighbourhoods (note that assessment requires 1 to 2 hours per neighbourhood). These assessed neighbourhood are used as training data in supervised algorithms.
-# Features
-If you want to implement your own algorithm and test it on our dataset, it is possible! You will have to perform the following steps:
-1. Create a new class that represents your algorithm. This class must extends the class `Method`, i.e. `class NewAlgorithm(Method):`.
-2. Then, your algorithm must implemented `fit()` and `predict()` functions.
-3. You can now use your algorithm on our dataset by create a new instance of Data and another of Dataset. Do not forget to init them by calling the function (`init_all_in_one()`).
-Below, you will find a very simple example to illustrate the aforementioned steps.
 ```python
 # file ./classes/NewAlgorithm.py
-class NewAlgorithm(Method):
+from predihood.classes.Classifier import Classifier
-    def __init__(self, _bar=1):
-        self.foo = 0
-        self.bar = _bar
-    def fit(self):
-        # do stuff here
-    def predict(self, iris):
-        # do stuff here
-if name == "__main__":
-    data = new Data(normalization="density", filtering=True)
-    data.init_all_in_one()
-    dataset = new Dataset(data, "", "supervised")
-    dataset.init_all_in_one()
-    new_algorithm = new NewAlgorithm(bar=2)
-    new_algorithm.fit()
-    new_algorithm.predict("012345678")
-```
-# Use cases
+class NewAlgorithm(Classifier):
+  def __init__(self, alpha=1, beta="2"):
+    Classifier.__init__(self)
+    self.alpha = alpha
+    self.beta = beta
-## Alice, a newcomer in Lyon
+  def fit(self, X, y):
+    # do stuff here
-Alice is an IT commercial, therefore she often moves across the whole country. She is recruited for a mission in Lyon for 6 months before going back to Paris. Alice would like to find a neighbourhood which is urban, near from shops and, if possible, near from a gym. She knows from her friends that the Part-Dieu neighbourhood is in the CBD (Central Business District) but she prefers to compare it with others before having her final decision. With Predihood, Alice writes the query "Lyon" in the search bar.  Then she compare several neighbourhoods using the environment variables and selects a few that she might like. So she compares in detail two neighbourhoods: "Part-Dieu" and "Danton Bir-Akeim". The first one has a lot of shops and services, illustrated by three grouped indicators: "indicateurs service-divers-prive" (i.e. private services such as banks, driving schools or travel agencies), "service-divers-public" (i.e. public services such as post offices or direction of public finances) and "animation-commerce-nonalimentaire" (i.e. shops such as clothing stores, hairdressers or appliance stores). The second one has a gym as shown by the indicators "salles multisports" (i.e. multisport hall) and "salles de remise en forme" (i.e. fitness room). Alice prefers to be near from shops and will go to the gym by bike. Finally, she selects the "Part-Dieu" neighbourhood for finding an apartment. Moreover, our tool provides a confidence score, which corresponds to the number of lists of relevant indicators which have predict the selected value. For example, all lists have predict the neighbourhood "Part-Dieu" as a "towers" neighbourhood (score at 7/7).
-![Screenshot of Predihood](predihood-predictions.png)
-## Bob, an IT professor
-Bob is an associate professor in artifical intelligence at the University of Lyon. He works on a new supervised learning algorithm and on improving some existing supervised learning algorithms. First, he would like to test the impact of some parameters on the performances. To achieve this, he adds his proposals in our Predihood tool and test a lot of different configurations. He can easily tune his algorithms with the tuning panel. For example he changes with several values the number of neighbourhoods for his improved version of the KNN algorithm. Then we would like to run his algorithms on another dataset than his to test his robustness. To achieve this, he runs his new algorithm on our daatset composed of assessed neighbourhood. Finally, Bob teaches a machine learning course and he plans to give a practical course on Scikit-learn algorithms. With Predihood, Bob's students use the interface to learn basic parameters and their influence on results. Moreover, they can export their results as Excel in order to have a detailed experiments section.
-![Screenshot of Predihood](predihood-accuracies.png)
-# Acknowledgements
-This work has been partially funded by LABEX IMU (ANR-10-LABX-0088) from Université de Lyon, in the context of the program "Investissements d'Avenir" (ANR-11-IDEX-0007) from the French Research Agency (ANR).
-# References
+  def predict(self, indicators_values):
+    # do stuff here
+  def get_params(self, deep=True):
+    # suppose this estimator has parameters "alpha" and "beta"
+    return {"alpha": self.alpha, "beta": self.beta}
+  def set_params(self, **parameters):
+    for parameter, value in parameters.items():
+      setattr(self, parameter, value)
+    return self
+```
+After that, your algorithm is ready to be used in Predihood. 
+# Mentions of Predihood
+Our approach Predihood has been presented during the DATA conference [@barretpredicting].
+This first screenshot shows the generic interface of Predihood for tuning algorithms. The left panel allows to tune parameters and hyper parameters, such as training and test sizes. On the right, the table illustrates the accuracies obtained for each lists (generated during the selection process) and each environment variable. You can export these results by clicking on the download icon.
-# Predihood: an open-source tool for describing and predicting neighbourhoods' environment
-tags: Python, MongoDB, data management, neighborhood, prediction, machine learning
-# Statement of need
-Finding a real-estate in a new city is still a challenge. We often arrive in a city we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood.
-Some projects have been focused on qualify neighbourhoods, such as Livehoods [REF] and Hoodsquare [REF]. The Livehoods project aims at defining and computing dynamics of neighbourhoods based on social network data while the Hoodsquare project detect similar areas based on Foursquare check-ins. Regarding a lot of papers about this challenges, our contribution differs on several points. A number of works are limited to a few cities, some others intrtoduce biais by using social networks and finally, the majority of works are focusing on life quality. Contrary to exitsing works, our approach works for a whole country (namely in France), is based on __reliable__ sources and a social study, and is focused on the environment of neighbourhood. Moreover, aach neighbourhood can be described by thousands of indicators. Besides, it is impossible to manually exploit these indicators and we don't have a global view of the principal characteristics of the neighbourhood. 
-In order to describe in the most accurate way the environment of a neighbourhood, social science researchers have defined six environment variables with a limited number of values for each one. These six variables are the _building type_, the _building usage_, the _landscape_, the _social class_, the _morphological position_ and the _geographical position_. As an example, the _landscape_ can be evaluated as _urban_, _green areas_, _forest_ or _countryside_ while the _social class_ have values from _lower_ to _upper_. These variables are commonly accepted and easy to understand and use. There is still a challenge about describing each neighbourhood in a whole country with these six variables. To tackle this challenge, our objective is to predict by supervised learning the environment variable whatever the neighbourhood.
-# Methodology
- dire que pour la prédiction, il faut des données sur les quartiers (géometrie + indicateurs). Dire que predihood intègre des données sur la FR via mongiris 
-(url) mais que c'est modifiable pour d'autres pays;
-In order to predict the environment of neighbourhoods, we have to gather data about neighbourhoods. There are mainly two type of data: the geometry which describe the shape of the neighbourhood and indicators that quantify the environment. For example, there are the number of restaurants, the average income or even the number of houses over 250 $$m^2$$. Predihood integrate such data for France by using [mongiris](https://gitlab.liris.cnrs.fr/fduchate/mongiris), an interface for querying French administrative areas. There are only data about French areas but this can be extended for other countries.
- parler de l'évaluation manuelle (300 iris expertisés)
-After gathering data, the next step is to assessed some neighbourhoods **because of** the supervised learning approach. This manual assessment have been realized by social science researchers. This have been done by investigating Google Street View (building and streets pictures, parked cars, facilities and greens areas) and requires between one to two hours for a single neighbourhood. A total of 300 IRIS have been annotated.
-> These assessed neighbourhood are used as training data in supervised algorithms.
- mettre le contenu de la section features ensuite : dire d'abord que predihood permet de réutiliser les algos de scikit-learn / même implémentation, ex d'algos,
-Predihood proposes a generic interface for tuning algorithms more easilly. This interface is based on [Scikit-learn](https://scikit-learn.org/stable/) algorithms but can handle hand-made ones. 
-> 
-> If you want to implement your own algorithm and test it on our dataset, it is possible! You will have to perform the following steps:
-1. Create a new class that represents your algorithm. This class must extends the class `Method`, i.e. `class NewAlgorithm(Method):`.
-2. Then, your algorithm must implement `fit()` and `predict()` functions.
-3. You can now use your algorithm on our dataset by create a new instance of Data and another of Dataset. Do not forget to init them by calling the function (`init_all_in_one()`).
-Below, you will find a very simple example to illustrate the aforementioned steps.
-```python
-# file ./classes/NewAlgorithm.py
-class NewAlgorithm(Method):
-    def __init__(self, _bar=1):
-        self.foo = 0
-        self.bar = _bar
-    def fit(self):
-        # do stuff here
-    def predict(self, iris):
-        # do stuff here
-if name == "__main__":
-    data = new Data(normalization="density", filtering=True)
-    data.init_all_in_one()
-    dataset = new Dataset(data, "", "supervised")
-    dataset.init_all_in_one()
-    new_algorithm = new NewAlgorithm(bar=2)
-    new_algorithm.fit()
-    new_algorithm.predict("012345678")
-```
-# Mentions of Predihhod
- mettre la ref DATA et des captures (écran de parametrage, éventuellement écran avec carte pour illustrer la prédiction).
 ![Screenshot of Predihood](predihood-accuracies.png)
+This screenshot exposes the cartographic interface of Predihood, used mostly by people who search for a new living place. By searching an area in the inputs on the left and then clicking on neighbourhoods, you will be able to choose an algorithm to predict environment variables of the chosen neighbourhood. For beginners, `Random Forest` classifier is recommended. For example, Alice is an IT commercial and has been recruited for a mission in Lyon for 6 months before going back to Paris. She compares easily many neighbourdhoods in the CBD (Central Business District) of Lyon and chooses the "Part-Dieu" neighbourhood.
 ![Screenshot of Predihood](predihood-predictions.png)
 # Acknowledgements
@@ -263,3 +102,26 @@ if name == "__main__":
 This work has been partially funded by LABEX IMU (ANR-10-LABX-0088) from Université de Lyon, in the context of the program "Investissements d'Avenir" (ANR-11-IDEX-0007) from the French Research Agency (ANR).
 # References
+@inproceedings{cranshaw2012livehoods,
+  title={The livehoods project: Utilizing social media to understand the dynamics of a city},
+  author={Cranshaw, Justin and Schwartz, Raz and Hong, Jason I and Sadeh, Norman},
+  booktitle={International AAAI Conference on Weblogs and Social Media},
+  pages={58},
+  year={2012}
+}
+@inproceedings{zhang2013hoodsquare,
+  title={Hoodsquare: Modeling and recommending neighborhoods in location-based social networks},
+  author={Zhang, Amy X and Noulas, Anastasios and Scellato, Salvatore and Mascolo, Cecilia},
+  booktitle={2013 International Conference on Social Computing},
+  pages={69--74},
+  year={2013},
+  organization={IEEE}
+}
+@article{barretpredicting,
+  title={Predicting the enviornment of a neighbourhood: a use case for France},
+  author={Barret, Nelly and Duchateau, Fabien and Favetta, Franck and Bonneval, Loic}
+}  
--- a/predihood-indicators.png
+++ b/predihood-indicators.png
--- a/predihood/classes/Classifier.py
+++ b/predihood/classes/Classifier.py
@@ -2,7 +2,7 @@ class Classifier:
    # def __init__(self):
    def fit(self, X, y):
-        return None
+        raise Exception("You must implement your own fit function.")
    def predict(self, df):
-        return ["default_value"]
+        raise Exception("You must implement your own predict function.")
--- a/predihood/classes/MethodPrediction.py
+++ b/predihood/classes/MethodPrediction.py
@@ -32,6 +32,8 @@ class MethodPrediction(Method):
        """
        Compute performance metrics, i.e. accuracy.
        """
+        print(self.dataset.X)
+        print(self.dataset.Y)
        scores = cross_val_score(self.classifier, self.dataset.X, self.dataset.Y, cv=5, scoring="accuracy")
        self.accuracy = scores.mean() * 100
        print(self.accuracy)

--- a/predihood/classes/NewAlgorithm.py
+++ b/predihood/classes/NewAlgorithm.py
@@ -2,14 +2,24 @@ from predihood.classes.Classifier import Classifier
 class NewAlgorithm(Classifier):
+    """
+    A new algorithm.
+    Parameters
+    ----------
+    alpha : int, default=1
+        The first parameter.
+    beta : str, default="2"
+        The second parameter.
+    """
    def __init__(self, alpha=1, beta="2"):
        Classifier.__init__(self)
        self.alpha = alpha
        self.beta = beta
    def fit(self, X, y):
        print("here")
-        return None
+        return self
    def predict(self, df):
        print("there")

--- a/predihood/classifiers_list.py
+++ b/predihood/classifiers_list.py
@@ -7,12 +7,14 @@ from sklearn.svm import SVC
 from sklearn.tree import DecisionTreeClassifier
+# to add a new classifier, add as key the name of your classifier and as value the name of the class that implement your classifier.
+# do not forget to import your classifier
 AVAILABLE_CLASSIFIERS = {
-    "RandomForestClassifier": RandomForestClassifier,
+    "Random Forest Classifier": RandomForestClassifier,
-    "KNeighborsClassifier": KNeighborsClassifier,
+    "KNeighbors Classifier": KNeighborsClassifier,
-    "DecisionTreeClassifier": DecisionTreeClassifier,
+    "Decision Tree Classifier": DecisionTreeClassifier,
    "SVC": SVC,
-    "AdaBoostClassifier": AdaBoostClassifier,
+    "AdaBoost Classifier": AdaBoostClassifier,
-    "MLPClassifier": MLPClassifier,
+    "MLP Classifier": MLPClassifier,
-    "NewAlgorithm": NewAlgorithm
+    "New Algorithm": NewAlgorithm
 }
--- a/predihood/utility_functions.py
+++ b/predihood/utility_functions.py
@@ -221,13 +221,16 @@ def signature(chosen_algorithm):
    try:
        # model = eval(_chosen_algorithm) # never use eval on untrusted strings
        model = get_classifier(chosen_algorithm)
+        print(model)
        doc = model.__doc__  # TODO: specify case when there is no doc (user-implemented algorithm)
+        print(doc)
        param_section = "Parameters"
        dashes = "-" * len(param_section)  # -------
        number_spaces = doc.find(dashes) - (doc.find(param_section) + len(param_section))
        attribute_section = "Attributes\n"
        # sub_doc is the param section of the docs (i.e. without attributes and some text)
        sub_doc = doc[doc.find(param_section) + len(param_section) + number_spaces + len(dashes) + len("\n"):doc.find(attribute_section)]
+        print(sub_doc)
    except:
        raise Exception("This algorithm does not exist for the moment...")
    params = inspect.getfullargspec(model.__init__).args[1:]  # get parameter' names -- [1:] to remove self parameter