diff --git a/paper.md b/paper.md index 780d7951f2de4bc1c042ce5dcd174bdca4f2f6d2..91dd6996fa29f115fa3723ff0a025902aac3774a 100644 --- a/paper.md +++ b/paper.md @@ -4,7 +4,7 @@ tags: - Python - MongoDB - data management - - neighborhood + - neighbourhood - prediction - machine learning authors: @@ -27,235 +27,74 @@ bibliography: paper.bib --- - # Statement of need -- contexte - -Finding a real-estate in a new city is still a challenge. We often arrive in a city we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood. - - -- un peu de related (comment les autres font pour qualifier un quartier, et dire que c'est limité à qques villes, partiellement biaisé, etc.) - - - -- dire que des socios ont défini des VE (pas de tableau, juste liste des 6 + valeurs de social). Dire que ces vars sont "universelles" (commmonly accepted ?) -mais que le pb est de qualifier chaque quartier avec ces VE - - - -- donc objectif de predihood est de prédire ces VE +Finding a real estate in a new city is still a challenge. We often arrive in a city we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood. +Some projects have been focused on qualifying neighbourhoods, such as Livehoods [@cranshaw2012livehoods] and Hoodsquare [@zhang2013hoodsquare]. The Livehoods project aims at defining and computing dynamics of neighbourhoods based on data gathered from social networks data while the Hoodsquare project detects similar areas based on Foursquare check-ins. Regarding a lot of papers about these challenges, our contribution differs on several points. Numerous works are limited to a few cities, some others introduce bias by using social networks and finally, the majority of works are focusing on life quality. Contrary to existing works, our approach works for a whole country (namely in France), is based on reliable and frequently updated sources and a social study and is focused on the environment of neighbourhoods. Moreover, each neighbourhood can be described by thousands of indicators. Besides, it is not possible to manually exploit these indicators and we don't have a global view of the principal characteristics of the neighbourhood. +In order to describe in the most accurate way the environment of a neighbourhood, social science researchers have defined six environment variables with a limited number of values for each one. These six variables are the _building type_, the _building usage_, the _landscape_, the _social class_, the _morphological position_ and the _geographical position_. As an example, the _landscape_ can be evaluated as _urban_, _green areas_, _forest_ or _countryside_ while the _social class_ have values from _lower_ to _upper_. These variables are commonly accepted and easy to understand and use. There is still a challenge about describing each neighbourhood in a whole country with these six variables. To tackle this challenge, our objective is to predict by supervised learning the environment variables whatever the neighbourhood. # Methodology +In order to predict the environment of neighbourhoods, we have to gather data about neighbourhoods. There are mainly two types of data: the geometry which describe the shape of the neighbourhood and indicators that quantify the environment. For example, there are the number of restaurants, the average income or even the number of houses over 250 $$m^2$$. Predihood integrates such data for France by using [mongiris](https://gitlab.liris.cnrs.fr/fduchate/mongiris), an interface for querying French administrative areas. There are only data about French areas, but this can be extended to other countries. -- dire que pour la prédiction, il faut des données sur les quartiers (géometrie + indicateurs). Dire que predihood intègre des données sur la FR via mongiris -(url) mais que c'est modifiable pour d'autres pays; - - - -- parler de l'évaluation manuelle (300 iris expertisés) - - - - - -- mettre le ccontenu de la section features ensuite : dire d'abord que predihood permet de réutiliser les algos de scikit-learn / même implémentation, ex d'algos, - - - - - -# Mentions of Predihhod -- mettre la ref DATA et des captures (écran de parametrage, éventuellement écran avec carte pour illustrer la prédiction). - - - - +After gathering data, the next step is to assess some neighbourhoods **because of** the supervised learning approach. This manual assessment has been realized by social science researchers. This have been done by investigating Google Street View (building and streets pictures, parked cars, facilities and greens areas) and requires between one to two hours for a single neighbourhood. A total of 300 IRIS have been annotated, which will be used as training data. +In order to unify the view between assessed neighbourhoods and their indicators, datasets have been constructed. They look like Figure 1 and are composed of the code INSEE of the neighbourhood, its indicators that have been normalized by density of population and the expertise of social science researchers for the six environment variables. + -# Introduction +It is now possible to predict the environment of any neighbourhood in France using our unified dataset. Because neighbourhoods are represented by hundreds of indicators, a selection process selects subsets of relevant indicators. These subsets are called _lists_ and contain from 10 to 100 indicators. They are used in the Predihood interface to predict environment. +Predihood proposes a generic interface for tuning algorithms more easily. This interface is based on [Scikit-learn](https://scikit-learn.org/stable/) algorithms but can handle hand-made ones. To implement your own algorithm and test it on our dataset, follow these steps: -Getting to a new city after a job transfer is always a challenge! We often arrive in a city that we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood. We present, Predihood, a new tool which facilitates the search and the choice of a neighbourhood. +1. Create a new class that represents your algorithm and inherits from `Classifier`. +2. Then, implement the core of your algorithm by coding `fit()` and `predict()` functions. The `fit` function aims at fitting your classifier on assessed neighbourhoods while the `predict` function aims at predicting environment variables for a given neighbourhood. +3. Next, add `get_params()` and `sett_params()` to be compatible with Scikit-learn framework. +4. Add your classifier in the `AVAILABLE_CLASSIFIERS` variable (stored in `./classifiers_list.py`). +5. Finally, do not forget to comment your classifier with the Numpy style if you want to tune it. -Keywords: Machine learning, Supervised learning by classification, Data cleaning, Information visualization, Neighbourhood study - -We have identified mainly three challenges: - -1. The definition of a neighbourhood can vary a lot. We have to define quite precisely what is a neighbourhood. -2. Each neighbourhood can be described by thousands of indicators. Besides, it is impossible to manually exploit these indicators and we don't have a global view of the principal characteristics of the neighbourhood. We have to define a limited number of features to describe neighbourhoods. -3. How to predict the environment of any neighbourhood for a whole country (in this case France)? We have to propose a supervised approach implemented into an interface - -# Methodology - -To tackle these challenges, social science researchers have defined six environment variables which aim at describing the environment of a neighbourhood. This six variables are represented in the table below: - -| Building type | Building usage | Landscape | Social class | Morphological position | Geographical position | -|---------------|----------------|-----------|--------------|------------------------|-----------------------| -| Towers | Housing | Urban | Lower | Central | Centre | -| Housing subdivisions | Shopping | Green areas | Lower middle | Urban | North | -| Housing estates | Other activities | Forest | Middle | Peri-urban | North-East | -| Houses | | Countryside | Upper middle | Rural | East | -| Mixed | | | Upper | | ... | - -These variables facilitate the comparison of neighbourhoods. - -In order to predict the environment of any neighbourhood in France, social science researchers have manually assessed 300 neighbourhoods (note that assessment requires 1 to 2 hours per neighbourhood). These assessed neighbourhood are used as training data in supervised algorithms. - -# Features - -If you want to implement your own algorithm and test it on our dataset, it is possible! You will have to perform the following steps: - -1. Create a new class that represents your algorithm. This class must extends the class `Method`, i.e. `class NewAlgorithm(Method):`. -2. Then, your algorithm must implemented `fit()` and `predict()` functions. -3. You can now use your algorithm on our dataset by create a new instance of Data and another of Dataset. Do not forget to init them by calling the function (`init_all_in_one()`). - -Below, you will find a very simple example to illustrate the aforementioned steps. +Below, there is a very simple example to illustrate the aforementioned steps. ```python # file ./classes/NewAlgorithm.py -class NewAlgorithm(Method): - def __init__(self, _bar=1): - self.foo = 0 - self.bar = _bar - - def fit(self): - # do stuff here - - def predict(self, iris): - # do stuff here - -if name == "__main__": - data = new Data(normalization="density", filtering=True) - data.init_all_in_one() - dataset = new Dataset(data, "", "supervised") - dataset.init_all_in_one() - new_algorithm = new NewAlgorithm(bar=2) - new_algorithm.fit() - new_algorithm.predict("012345678") -``` +from predihood.classes.Classifier import Classifier -# Use cases +class NewAlgorithm(Classifier): + def __init__(self, alpha=1, beta="2"): + Classifier.__init__(self) + self.alpha = alpha + self.beta = beta -## Alice, a newcomer in Lyon - -Alice is an IT commercial, therefore she often moves across the whole country. She is recruited for a mission in Lyon for 6 months before going back to Paris. Alice would like to find a neighbourhood which is urban, near from shops and, if possible, near from a gym. She knows from her friends that the Part-Dieu neighbourhood is in the CBD (Central Business District) but she prefers to compare it with others before having her final decision. With Predihood, Alice writes the query "Lyon" in the search bar. Then she compare several neighbourhoods using the environment variables and selects a few that she might like. So she compares in detail two neighbourhoods: "Part-Dieu" and "Danton Bir-Akeim". The first one has a lot of shops and services, illustrated by three grouped indicators: "indicateurs service-divers-prive" (i.e. private services such as banks, driving schools or travel agencies), "service-divers-public" (i.e. public services such as post offices or direction of public finances) and "animation-commerce-nonalimentaire" (i.e. shops such as clothing stores, hairdressers or appliance stores). The second one has a gym as shown by the indicators "salles multisports" (i.e. multisport hall) and "salles de remise en forme" (i.e. fitness room). Alice prefers to be near from shops and will go to the gym by bike. Finally, she selects the "Part-Dieu" neighbourhood for finding an apartment. Moreover, our tool provides a confidence score, which corresponds to the number of lists of relevant indicators which have predict the selected value. For example, all lists have predict the neighbourhood "Part-Dieu" as a "towers" neighbourhood (score at 7/7). - - - -## Bob, an IT professor - -Bob is an associate professor in artifical intelligence at the University of Lyon. He works on a new supervised learning algorithm and on improving some existing supervised learning algorithms. First, he would like to test the impact of some parameters on the performances. To achieve this, he adds his proposals in our Predihood tool and test a lot of different configurations. He can easily tune his algorithms with the tuning panel. For example he changes with several values the number of neighbourhoods for his improved version of the KNN algorithm. Then we would like to run his algorithms on another dataset than his to test his robustness. To achieve this, he runs his new algorithm on our daatset composed of assessed neighbourhood. Finally, Bob teaches a machine learning course and he plans to give a practical course on Scikit-learn algorithms. With Predihood, Bob's students use the interface to learn basic parameters and their influence on results. Moreover, they can export their results as Excel in order to have a detailed experiments section. - - - -# Acknowledgements - -This work has been partially funded by LABEX IMU (ANR-10-LABX-0088) from Université de Lyon, in the context of the program "Investissements d'Avenir" (ANR-11-IDEX-0007) from the French Research Agency (ANR). - -# References + def fit(self, X, y): + # do stuff here + def predict(self, indicators_values): + # do stuff here + def get_params(self, deep=True): + # suppose this estimator has parameters "alpha" and "beta" + return {"alpha": self.alpha, "beta": self.beta} + def set_params(self, **parameters): + for parameter, value in parameters.items(): + setattr(self, parameter, value) + return self +``` +After that, your algorithm is ready to be used in Predihood. +# Mentions of Predihood +Our approach Predihood has been presented during the DATA conference [@barretpredicting]. - - - - - - - - - - - - - - - - - - - -# Predihood: an open-source tool for describing and predicting neighbourhoods' environment - -tags: Python, MongoDB, data management, neighborhood, prediction, machine learning - - -# Statement of need - -Finding a real-estate in a new city is still a challenge. We often arrive in a city we don't know, thus finding the perfect living place becomes complex. Nearby public transport on one hand, a rural landscape on the other hand, an animated neighbourhood for some, far from urban hustle and bustle for others: there are many criteria for choosing your future neighbourhood. - -Some projects have been focused on qualify neighbourhoods, such as Livehoods [REF] and Hoodsquare [REF]. The Livehoods project aims at defining and computing dynamics of neighbourhoods based on social network data while the Hoodsquare project detect similar areas based on Foursquare check-ins. Regarding a lot of papers about this challenges, our contribution differs on several points. A number of works are limited to a few cities, some others intrtoduce biais by using social networks and finally, the majority of works are focusing on life quality. Contrary to exitsing works, our approach works for a whole country (namely in France), is based on __reliable__ sources and a social study, and is focused on the environment of neighbourhood. Moreover, aach neighbourhood can be described by thousands of indicators. Besides, it is impossible to manually exploit these indicators and we don't have a global view of the principal characteristics of the neighbourhood. - -In order to describe in the most accurate way the environment of a neighbourhood, social science researchers have defined six environment variables with a limited number of values for each one. These six variables are the _building type_, the _building usage_, the _landscape_, the _social class_, the _morphological position_ and the _geographical position_. As an example, the _landscape_ can be evaluated as _urban_, _green areas_, _forest_ or _countryside_ while the _social class_ have values from _lower_ to _upper_. These variables are commonly accepted and easy to understand and use. There is still a challenge about describing each neighbourhood in a whole country with these six variables. To tackle this challenge, our objective is to predict by supervised learning the environment variable whatever the neighbourhood. - - -# Methodology - - -- dire que pour la prédiction, il faut des données sur les quartiers (géometrie + indicateurs). Dire que predihood intègre des données sur la FR via mongiris -(url) mais que c'est modifiable pour d'autres pays; - -In order to predict the environment of neighbourhoods, we have to gather data about neighbourhoods. There are mainly two type of data: the geometry which describe the shape of the neighbourhood and indicators that quantify the environment. For example, there are the number of restaurants, the average income or even the number of houses over 250 $$m^2$$. Predihood integrate such data for France by using [mongiris](https://gitlab.liris.cnrs.fr/fduchate/mongiris), an interface for querying French administrative areas. There are only data about French areas but this can be extended for other countries. - -- parler de l'évaluation manuelle (300 iris expertisés) - -After gathering data, the next step is to assessed some neighbourhoods **because of** the supervised learning approach. This manual assessment have been realized by social science researchers. This have been done by investigating Google Street View (building and streets pictures, parked cars, facilities and greens areas) and requires between one to two hours for a single neighbourhood. A total of 300 IRIS have been annotated. - -> These assessed neighbourhood are used as training data in supervised algorithms. - -- mettre le contenu de la section features ensuite : dire d'abord que predihood permet de réutiliser les algos de scikit-learn / même implémentation, ex d'algos, - -Predihood proposes a generic interface for tuning algorithms more easilly. This interface is based on [Scikit-learn](https://scikit-learn.org/stable/) algorithms but can handle hand-made ones. - - -> - -> If you want to implement your own algorithm and test it on our dataset, it is possible! You will have to perform the following steps: - -1. Create a new class that represents your algorithm. This class must extends the class `Method`, i.e. `class NewAlgorithm(Method):`. -2. Then, your algorithm must implement `fit()` and `predict()` functions. -3. You can now use your algorithm on our dataset by create a new instance of Data and another of Dataset. Do not forget to init them by calling the function (`init_all_in_one()`). - -Below, you will find a very simple example to illustrate the aforementioned steps. - -```python -# file ./classes/NewAlgorithm.py -class NewAlgorithm(Method): - def __init__(self, _bar=1): - self.foo = 0 - self.bar = _bar - - def fit(self): - # do stuff here - - def predict(self, iris): - # do stuff here - -if name == "__main__": - data = new Data(normalization="density", filtering=True) - data.init_all_in_one() - dataset = new Dataset(data, "", "supervised") - dataset.init_all_in_one() - new_algorithm = new NewAlgorithm(bar=2) - new_algorithm.fit() - new_algorithm.predict("012345678") -``` - - -# Mentions of Predihhod -- mettre la ref DATA et des captures (écran de parametrage, éventuellement écran avec carte pour illustrer la prédiction). +This first screenshot shows the generic interface of Predihood for tuning algorithms. The left panel allows to tune parameters and hyper parameters, such as training and test sizes. On the right, the table illustrates the accuracies obtained for each lists (generated during the selection process) and each environment variable. You can export these results by clicking on the download icon.  +This screenshot exposes the cartographic interface of Predihood, used mostly by people who search for a new living place. By searching an area in the inputs on the left and then clicking on neighbourhoods, you will be able to choose an algorithm to predict environment variables of the chosen neighbourhood. For beginners, `Random Forest` classifier is recommended. For example, Alice is an IT commercial and has been recruited for a mission in Lyon for 6 months before going back to Paris. She compares easily many neighbourdhoods in the CBD (Central Business District) of Lyon and chooses the "Part-Dieu" neighbourhood. +  # Acknowledgements @@ -263,3 +102,26 @@ if name == "__main__": This work has been partially funded by LABEX IMU (ANR-10-LABX-0088) from Université de Lyon, in the context of the program "Investissements d'Avenir" (ANR-11-IDEX-0007) from the French Research Agency (ANR). # References + +@inproceedings{cranshaw2012livehoods, + title={The livehoods project: Utilizing social media to understand the dynamics of a city}, + author={Cranshaw, Justin and Schwartz, Raz and Hong, Jason I and Sadeh, Norman}, + booktitle={International AAAI Conference on Weblogs and Social Media}, + pages={58}, + year={2012} +} + +@inproceedings{zhang2013hoodsquare, + title={Hoodsquare: Modeling and recommending neighborhoods in location-based social networks}, + author={Zhang, Amy X and Noulas, Anastasios and Scellato, Salvatore and Mascolo, Cecilia}, + booktitle={2013 International Conference on Social Computing}, + pages={69--74}, + year={2013}, + organization={IEEE} +} + +@article{barretpredicting, + title={Predicting the enviornment of a neighbourhood: a use case for France}, + author={Barret, Nelly and Duchateau, Fabien and Favetta, Franck and Bonneval, Loic} +} + diff --git a/predihood-indicators.png b/predihood-indicators.png new file mode 100644 index 0000000000000000000000000000000000000000..1d7677eb2600e4ac76e3cd8ac395825ce951ce3d Binary files /dev/null and b/predihood-indicators.png differ diff --git a/predihood/classes/Classifier.py b/predihood/classes/Classifier.py index 8b51856b62a96d1b22ec64c4424ab79562eefb3d..beb962f28442d04b2b18c87173e0c1d5fc5607ec 100644 --- a/predihood/classes/Classifier.py +++ b/predihood/classes/Classifier.py @@ -2,7 +2,7 @@ class Classifier: # def __init__(self): def fit(self, X, y): - return None + raise Exception("You must implement your own fit function.") def predict(self, df): - return ["default_value"] + raise Exception("You must implement your own predict function.") diff --git a/predihood/classes/MethodPrediction.py b/predihood/classes/MethodPrediction.py index 99556f98fe9d5cedc2ce38a80f9292d2f584f3c1..37840859c666f1a863597c50767e5ccae9e3acb9 100644 --- a/predihood/classes/MethodPrediction.py +++ b/predihood/classes/MethodPrediction.py @@ -32,6 +32,8 @@ class MethodPrediction(Method): """ Compute performance metrics, i.e. accuracy. """ + print(self.dataset.X) + print(self.dataset.Y) scores = cross_val_score(self.classifier, self.dataset.X, self.dataset.Y, cv=5, scoring="accuracy") self.accuracy = scores.mean() * 100 print(self.accuracy) diff --git a/predihood/classes/NewAlgorithm.py b/predihood/classes/NewAlgorithm.py index e31533f76db9bb7204af4fafefdcff3f5c8ef18b..b72694467af94807d17485476c1364b3d26ea87a 100644 --- a/predihood/classes/NewAlgorithm.py +++ b/predihood/classes/NewAlgorithm.py @@ -2,14 +2,24 @@ from predihood.classes.Classifier import Classifier class NewAlgorithm(Classifier): + """ + A new algorithm. + Parameters + ---------- + alpha : int, default=1 + The first parameter. + beta : str, default="2" + The second parameter. + """ def __init__(self, alpha=1, beta="2"): + Classifier.__init__(self) self.alpha = alpha self.beta = beta def fit(self, X, y): print("here") - return None + return self def predict(self, df): print("there") diff --git a/predihood/classifiers_list.py b/predihood/classifiers_list.py index c7c1ff88991e329c2de9d53518dd3d3909bda841..f069d37bfff5d66926f7c50ebeb42349f29a5ccb 100644 --- a/predihood/classifiers_list.py +++ b/predihood/classifiers_list.py @@ -7,12 +7,14 @@ from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier +# to add a new classifier, add as key the name of your classifier and as value the name of the class that implement your classifier. +# do not forget to import your classifier AVAILABLE_CLASSIFIERS = { - "RandomForestClassifier": RandomForestClassifier, - "KNeighborsClassifier": KNeighborsClassifier, - "DecisionTreeClassifier": DecisionTreeClassifier, + "Random Forest Classifier": RandomForestClassifier, + "KNeighbors Classifier": KNeighborsClassifier, + "Decision Tree Classifier": DecisionTreeClassifier, "SVC": SVC, - "AdaBoostClassifier": AdaBoostClassifier, - "MLPClassifier": MLPClassifier, - "NewAlgorithm": NewAlgorithm + "AdaBoost Classifier": AdaBoostClassifier, + "MLP Classifier": MLPClassifier, + "New Algorithm": NewAlgorithm } diff --git a/predihood/utility_functions.py b/predihood/utility_functions.py index 7b0cfe974ce4b0df84f15dc6aaee6c8a44738add..e60b835e4a1a1a93b0494cdb65aec6d549c0aa79 100644 --- a/predihood/utility_functions.py +++ b/predihood/utility_functions.py @@ -221,13 +221,16 @@ def signature(chosen_algorithm): try: # model = eval(_chosen_algorithm) # never use eval on untrusted strings model = get_classifier(chosen_algorithm) + print(model) doc = model.__doc__ # TODO: specify case when there is no doc (user-implemented algorithm) + print(doc) param_section = "Parameters" dashes = "-" * len(param_section) # ------- number_spaces = doc.find(dashes) - (doc.find(param_section) + len(param_section)) attribute_section = "Attributes\n" # sub_doc is the param section of the docs (i.e. without attributes and some text) sub_doc = doc[doc.find(param_section) + len(param_section) + number_spaces + len(dashes) + len("\n"):doc.find(attribute_section)] + print(sub_doc) except: raise Exception("This algorithm does not exist for the moment...") params = inspect.getfullargspec(model.__init__).args[1:] # get parameter' names -- [1:] to remove self parameter