Back

OpenGeoHub Summer School Siegburg 2022

Summer School 2022 / KISTE project workshop:
“Open Source solutions for Earth system data
(R, OSGeo, Python)”

Dates: 28 August 2022 – 03 September 2022 Location: Friendly City Hotel Oktopus, Siegburg (Germany) OpenGeoHub Summer School is an annual event that has been running at various locations in Europe and Canada and Australia since 2010. Every year, we invite researchers and specialists who we think are especially active (and successful) in developing open source software and open data, and helping other researchers improve their analysis and modeling frameworks. The 2022 event will be held in a hybrid form with physical lectures and hackathons with the support from the KISTE project. The Summer School will include:
  • Live presentations and demos by leading R and OSGeo developers,
  • 5 days of Earth System Analysis training sessions and R / OSGeo tutorials
    • Including 1 day of R-Spatial workshops,
    • Including 1 focus day “extreme events”
  • Discussion panels and break-out rooms.

This Summer School is a not-for-profit event and can be followed online.

Videorecordings

Julia is a programming language that is simple to write and scriptable like Python and R, but fast like C or C++. At 10 years, it’s a young language, so the ecosystem isn’t as large and mature as you want it to be. Maarten Pronk was an early adopter of the language in his research at Deltares, a Dutch research institute. In this lecture(s) he introduced Julia, his motivation to use it and his OSS journey. Half of the lecture was non-spatial, while the latter half focused on the JuliaGeo ecosystem and showcased some of the possibilities of the Julia language.


All materials are available here: https://av.tib.eu/media/59405

This session introduced the machine-learning framework {mlr3} in R. {mlr3} is a framework similar to {tidymodels} but based on different backends. It heavily relies on {data.table} and {future} and is built using the R6 object-oriented-programming class. Essential building blocks and basic concepts were presented to participants and applied afterwards in a hands-on tutorial part.
All materials are available here: https://av.tib.eu/media/59409


In this first part, the participants saw how it is possible to visualise large cloud-optimised data sets in R. They used R packages {leaflet} and {leafem}, and specifically the development version of {leafem}. They saw how to render large vector and raster data sets that are remotely hosted (an AWS S3 bucket in this case) and learned how to control styling and appearance. In this second part, they learned how to prepare their own data in order to host them in the cloud and visualise them using the tools from the first session. This part was mostly about using the command line to prepare data, rather than using R. This part was important as the future of data analysis lies more and more in the cloud and it gave the participants the means to share their large data sets with the world (or whoever they wanted to share them with).

All materials are available at: https://av.tib.eu/media/59410

Stan Openshaw defined in 2000 that “GeoComputation is about using the various different types of geodata and about developing relevant geo-tools within the overall context of a ‘scientific’ approach”. In subsequent years, the idea of geocomputation gained much traction, with many new spatial data models, spatial data sources, geocomputation methods, and spatial visualizations. At the time of making the above definition, it was unrealistic to expect people to reproduce or replicate code examples automatically. Gladly, in addition to the geocomputation developments, we have seen a growing interest in reproducibility in many fields, including geocomputation. Reproducibility has many advantages, as it promotes the use of best practices, improves transparency and reusability, and allows for sharing the code and code workflows. The goal of this tutorial is to: a) provide an introduction to basic concepts of geocomputation, geocomputation with R, and reproducibility, and b) show various approaches and tools that allow for reproducibility and replicability, including R scripts, RStudio projects, {reprex}, {renv}, Git, Docker, and more. The tutorial is a mixture of theoretical and practical: each concept is first described and explained, and next, the attendees had a chance to solve problems related to geocomputation with R and reproducibility of geospatial analysis. All materials: https://av.tib.eu/media/59404

Edzer Pebesma went through chapters 1-3 of https://r-spatial.org/book/  and illustrated things with package sf (see ch 7). This introduces one to spatial data, vector data, tesselations, geometries, simple features, geometric predicates, geometric transformations.

All materials are available: https://av.tib.eu/media/59401

In this second part Edzer Pebesma went through chapters 4-6 of https://r-spatial.org/book and discussed spherical geometry, spatiotemporal information, irregular spatiotemporal data in sftime, spatiotemporal raster and vector data cubes in package stars, and chapter 5 on support of spatial data.
All materials are available: https://av.tib.eu/media/59406

Open Reproducible Research (ORR) is a crucial topic in Open Science and combines several open practices, such as Open Data, Open Code, and Open Source Infrastructure. It can help us understand and tackle wicked global challenges, such as climate change, natural hazards, urbanisation, digitisation, and pandemics. But what does ORR actually mean? Is it common practice? (A little spoiler: It’s not!). What do you need to consider and are there any tools that can help you? You will find answers to these questions in this lecture and learn what is needed to achieve ORR, which obstacles make it difficult, and how ORR can contribute to increasing the transparency, verifiability, and reusability of research results.

All materials are available: https://av.tib.eu/media/59403

 

In this contribution, we will systematically walk to stochastic processes starting from random variables in 1,2, and more dimensions. We show tests and ways to create statistically relevant test data based on requirements or assumptions on the models to be tested. Emphasis is on testing strategies using standard python libraries.
All materials available: https://av.tib.eu/media/59416


In this training session, you will learn about the main concepts/aspects related to raster data cubes, cloud-optimized GeoTIFF (COG) and SpatioTemporal Asset Catalog (STAC), working with a practical example in Python. Using the eumap library and the training samples provided in the hackathon, you will perform a complete workflow for spatial predictive mapping, including: Spacetime overlay (through STAC + COG), Train a Random Forest classifier (with hyper-parameter optimization),  produce a classification output (also through STAC + COG). All the steps were executed in Google Colab and all the data (points and rasters) were accessed directly from the cloud (http://stac.ecodatacube.eu).

All materials are available: https://av.tib.eu/media/59408

The ecosystem of packages for spatial data handling and analysis in Python is extensive and covers both vector and raster analytics from small to large distributed data. This talk covers only a small part, focusing on vector data processing with GeoPandas at its core. First, it covers what GeoPandas is and how it relates to other packages and combines them into a user-friendly API. Then it looks at what GeoPandas enables with a light introduction to PySAL Python Spatial Analysis Library for spatial statistics and modelling. The final part touches on the scalability issue and introduces how to handle parallel computation and big data using GeoPandas and Dask. All of that is illustrated by hands-on tasks on real-world data.
All materials are available here: https://av.tib.eu/media/59414


Apache Arrow and Apache Parquet are upcoming formats (standards, with software libraries) for cloud-native data in columnar storage; their geo-varieties, geoarrow and geoparquet do the same for simple feature (vector) spatial data. In this lecture, the lecturers tried to explain what this means (“cloud-native”, “columnar storage”, “chunking”), what the correspondence and difference between arrow and parquet are, why and when you would (or would not) want to use them, and demonstrated how to use them, using R and/or Python.
All materials available here: https://av.tib.eu/media/59418

Machine learning approaches, particularly deep neural networks, are showing tremendous success in finding patterns and relationships in large datasets for prediction and classification, which are typically too complex for humans to grasp directly. In many cases, models are learned for automation as soon as manual analysis and interpretation of the data are too costly. Explainable machine learning, which analyzes the decision-making process of machine learning methods in more detail, is used whenever an explanation of the result is needed in addition to the result. This can be for various reasons, such as increasing trust in the result or deriving new scientific knowledge that can be inferred from patterns in the decision-making process of the machine learning model. In this lecture, we will consider the basics of explainable machine learning, reasons to seek explanations, and some applications and methods from close-range and satellite-based remote sensing. Applications include whale detection, ozone value estimation, and discovery of wilderness characteristics.

All materials available here: https://av.tib.eu/media/59417


In this session the lecturer discussed current challenges of using machine learning in the context of environmental monitoring. More specifically, the suitability of different cross-validation strategies to assess the prediction performance, including novel developments like the “nearest neighbor distance matching” method, were discussed. The lecturer enabled the attendees to further learn how suitable cross-validation strategies can be used to improve prediction models by applying them during hyperparameter tuning and variable selection. Finally, she thaugh them about the “area of applicability” of spatial prediction models – to limit predictions to the area where the model was enabled to learn about relationships. In the first part of the session, she guided the attendees through the motivation and conceptual ideas of these methods. In the second session how to apply these methods in practice. The newly suggested methods are implemented in the R package CAST which we will use together with the R package caret, but the attendees also learned how to use the methods with mlr3.

All materials are available: https://av.tib.eu/media/59412

This lecture provides an introduction to deep machine learning in the domain of weather, climate and air quality research. The atmospheric is a highly complex and dynamical system in which many physical, chemical and biological processes interact on a wide range of spatial and temporal scales. As a consequence, atmospheric data has some properties that differ from other machine learning applications and established ML methods may not always work well when applied to atmospheric data. The lecture is structured into four parts. First, we discuss some general properties of the atmosphere and atmospheric data. Next, a brief summary of atmospheric statistics and evaluation metrics is given. In the third part, we will learn some machine learning fundamentals and compare machine learning models and numerical (weather) models. The fourth part provides several examples of machine learning applications in the weather and climate domain with a focus on weather forecasting.

Useful readings are mentioned in the presentation below:

In this workshop were compared the most popular packages for raster and vector data processing in R and Python. The differences between them were also checked, and a test was made to know which has the best performance.

All materials are available: https://av.tib.eu/media/59407

In this talk were addressed some of the main modern challenges of geospatial data science grouped around five main aspects: (i) training data issues — these include mismatches in standards when recording measurements, legal and technical issues that limit data use, high costs of producing new observational data; (ii) modeling issues — these include point clustering and extrapolation problems, model overfitting, artifacts in input data, lack of historical data, lack of consistent global monitoring stations (Kulmala, 2018), limited availability and/or quality of global covariate layers; (iii) data distribution issues — these include unusable file formats, high data volumes, incomplete or inaccurate metadata; (iv) usability issues — these include incompleteness of data, unawareness of user communities and/or data limitations, data being irrelevant for decision making; and (v) governance issues — these include closed data policies from (non-)governmental organizations, datasets not used by organizations, lack of baseline estimates, absence of strategies for updates of data products. OpenGeoHub, together with 21 partners have launched an ambitious new European Commission-funded Horizon Europe project called “Open-Earth-Monitor” which aims at tackling some of the bottlenecks in the uptake and effective usage of environmental data, both ground observations and measurements and EO data. They plan to build upon existing Open Source and Open Data projects and build a number of tools, good practice guidelines and datasets that can help cope with some of the modern challenges.

www.earthmonitor.org

Even though extremes seem to occur more often in recent times, they are by definition rare events. This poses challenges on the modeling and prediction of these processes, as we have only few samples. This class is meant as a gentle introduction into extreme value theory in the uniand multivariate case. Different examples and approaches to fit models to heavy tailed distributions were shown and extensions shown to flexibly fit multivariate distributions. Models were discussed, and their implications in a spatial and spatio-temporal context. The second part of the class was devoted to a R-based hands-on session to apply the taught approaches to own data sets or demo data.

Lecturers

Topics of interest

Topics of interested are focused around but not limited to (unsorted):

  • AI for Earth System Sciences 
  • Modelling Extreme Events (with ML/AI and statistics) (~1 day)
  • Classifying remote sensing time series data,
  • Design-based methods for assessing map accuracy,
  • Point pattern analysis in R,
  • Analysing spatiotemporal disease/health data,
  • Reproducible research,
  • Machine Learning and deep learning methods for spatial data,
  • Analyzing massive amounts of EO data in the cloud with R, gdalcubes, and STAC,
  • Running a Research Data infrastructure with OS Tools
  • WebGIS,
  • Geocomputation and spatial segmentation in R,
  • Static and interactive visualization of spatial data

Materials

Materials in terms of R-markdown tutorials, screen-recordings and similar will be provided to all participants before the start of the Summer School via our official Mattermost channel.

Social events & hackathon

Recipe: Pre-order the group by their broader interest into subgroups. Split each subgroup equally where one half will be stationary and the other half be moving “clockwise” through the room. Pairs of people sit down for a given amount of time (~7 mins) and present their research topic to each other.

Recipe: Interested participants sign up for a 3-min-madness where they pitch their research in 3 min “on stage” in front of all participants. Elect a jury (among the lecturers) that will ask quick questions and will finally vote for the “best pitch”. This could also be flavored towards a “science slam”. 

Participants can nominate their latest research for a reproducibility check. Share your paper and the associated resources with the group and let’s explore how far we can reproduce your study and whether everyone achieves the same result.

Why should you attend this event?

  1. Find out what are the trends in spatial analysis and modeling from the leading open source developers.
  2. Follow online demo’s / tutorials, ask questions, share your experiences and help us resolve important issues.
  3. Connect to similar participants and network. Find co-authors and co-creators that match your style of work and ambitions.
  4. Contribute to the Open Source community and global good.

Target communities

  • Open source development communities on github and gitlab,
  • R-sig-geo mailing list,
  • R-Spatial.org package users,
  • OSGeo community,
  • PhD and MSc level students in the field of environmental sciences, GIS, civil engineering, spatial modelling, GeoAI, Earth Sciences and similar,

Venue

We have made a reservation at the Friendly City Hotel Siegburg close to Bonn for the period 28.08.2022 – 03.09.2022. They offer large and well equipped conference rooms and the venue would also allow for us to host parallel sessions as in previous years.

The offer includes:

  • Air-conditioned conference room (max capacity 80 people) with balcony & including standard equipment (beamer, screen, 3 flipcharts, 3 moderator boards, paper & pencils);
  • Free internet access throughout Hotel (600Mbit);
  • Coffee & tea all day long from the specialty (in front of the conference room); soft drinks will be charged to according consumption;
  • Candy bar with sweet and sour delicacies (in front of the conference room);
  • Active break in the morning: coffee from the machine specialty & various tea, fruit juices & sandwiches;
  • Vital lunch at noon the finger food in front of the room or buffet in the restaurant included octopus water on the table;
  • Active break in the afternoon: coffee from the specialty machine & various tea, fresh-cut fruit and pastries;

Accommodation costs (fixed price):

  • 105 EUR bed & breakfast (single);
  • 140 EUR bed & breakfast (double occupancy);

The pricing is at the upper boundary of what appears reasonable. Rooms and facilities appear to be modern and of good quality. The hotel is rated 3-star and has a 8.1 on booking.com.

Technology in use

Virtual machines / software installation

For consistency we recommend that all participants use a prepared virtual machine (docker containers), best via:

  • Linux users: DistroBox;
  • Windows user: VirtualBox / alternative: WSL;

This would make sure that everyone follows exactly the same settings / same package versions etc.

Use of virtual containers / virtual OS comes at the cost of RAM, so we need to provide to all participants instructions on how to prepare and emphasize that their laptops should have a minimum configuration of 16GB RAM and similar.

Video recordings

All lectures will be video-recorded using Zoom webinar functionality in HD quality. Subject to approval of the presenters, the videos will be uploaded to https://av.tib.eu/publisher/OpenGeoHub_Foundation and a DOI will be assigned to each talk. Lecturers will be asked to accept the general recording conditions and sign a document allowing for videos to be shared. Copyright of the video’s will be assigned to authors / presenters as in standard Open Access materials.

Code of conduct

Please read the Summer School Code of Conduct carefully before participating.

Spread the love