-
EricBoix authored81f92ac7
Federate data related "tools" (ETL)
Before being able to work on any given raw data one first needs to:
- be able to read/parse the format (be it a file to open or a stream) in one owns target language (if the researchers pipeline is in e.g. Python then one needs to read the data from Python). Even when not encoded, e.g. for XML files, the structure of the data needs to be understood and some syntactic sugar might be appreciated
- possibly anonymize the data (in order to respect some legal constraints),
- sanitize the data (remove degenerated data, or degenerated/ill formed field of data),
- re-sample the data when they are temporally missing captures,
- qualify the data: some data might be too redundant to contain "enough" information. All such tasks might require dedicated tools and specific know how that is of low interest to the researcher yet can be time consuming due to its technicality.
Proposition: gather such tools, libraries, recipes, code snippets in order to ease the burden of researchers.
Notes:
- Gathering such tools might start with pointing to ad-hoc already existing websites...
- There already exists ETL frameworks as well as specialized ETL (like HALE or FME for spatial data). Using such frameworks (as opposed to general purpose scripting languages boosted with wrapped ad hoc libraries) to express ones recipes might prove to be a big time saver.
Display a team know how on some given data
When New York city went open on its data it took a few weeks for two students before being able to concretely show some 3D rendering of the city geometry based on such data. Extracting advanced information (juiced data out) might prove to require a real know how. For example computing the road network load out of a geometrical description (dedicated to 3D rendering) of a road network will first require some topological fixes of the data (two endpoints of road segments might be geometrically superimposed yet not topologically connected). Blending the geometry/topology of such network with the traffic-light schedules might no be science yet might prove to be not such a trivial technical task.
A team might wishing to illustrate such a know how might need to "offer" some resulting data samples and beyond that present such per-treatment algorithms (e.g. a service offering on-line clean up of client uploaded data). Offering such a service will create technical needs for the team...
meta-datas
When given some raw data, automatically (or semi-automatically) extracted/generated associated meta-data can be valuable and a result per se. Such meta-data might range from simple information like boundary values, number of samplings, content access limits to more advanced quality indicators be they qualitative (poor, medium, high) to quantified indicators. When valuable such meta-data information might need to be stored, retrieved, mined...