Following up MLOps series, let me introduce you AQIMadrid a real use case.
Over the last few years, most capital cities in Europe provide free APIs to fetch real time data from different sensors distributed in different areas.
However, many times, this data is not tightly integrated, contains errors, and consequently don’t get easily treated by analysts or data scientists due to its complex management.
A particular use case of this problem is the Madrid city council. Its data portal(1) provides lots of APIs ranging from pollutants, weather, traffic, parkings, population, road accidents, etc.
Where all the efforts in providing those APIs lack results to facilitate people to understand how data shapes our lives.
There are two main official data sources, which are published at different time ranges.
First, the Madrid City Council pollution data in different formats (.json, .csv.. ) under free API usage. This data source is usually, and not always published between the 20th and 30th minute of every hour (2). It has happened to be days off, or contained not verified data. At first, it included data from 24 stations over Madrid, and lately it publishes 23 stations. The same data portal also provides real time weather data but most of its stations don’t match with pollutant stations, so that are located over other areas of Madrid.
Second, real time weather from a weather portal. Being much more robust and reliable, though it lacks pollutant real time data. Having analyzed different portals, such as AEMET, or Openweather, top performance and matching with pollutant stations resulted in having Accuweather as the weather portal of choice. The main caveat resulted in restricted API under subscription, and inability to be fetched from scrapers and web crawlers APIs such as Scrapy or BeautifulSoup. However it includes extended forecast data.
This can lead to problems at automating data fetching.
Next, will be needed to constantly insert or update(upsert) real time data in some database, to be analyzed, cleaned and transformed (ETL) as input to a machine learning model that computes real time weather and pollution forecasts for the given stations.
Ideally, a tracking system of predicted data can be set, so if model forecasts gets lower accuracy over time, a new model should be trained and put into production.
Finally, predicted data could be fancy displayed to final users, and easily integrated with other systems, for instance integrated with a booking backend app.
A way to solve data ingestion is by triggering workflows, where all the data extraction, cleaning, and transformed gets performed in an automated fashion way.
The city council data fetching can be performed by a REST client, and the weather data should be fetched from a web browser as Accuweather APIs are limited, so one solution is to use a virtual browser interface such as Selenium with a browser driver and fetch it from its styles (css or xpath) in its html document.
Next, to solve data timing mismatching between data sources, could be done by using SQL views that match data timestamps between our data tables. However, as it is also relevant for this use case to gather weather forecast data, and in this particular case data gets upsert, probably a suitable way to handle all this data matching and transformation is by using a distributed broker with persistence such as Apache Kafka through topics where data can be reconsumed many times by using consumer groups offsets. Making Apache Kafka suitable too to be used as an ephemeral data store to publish predicted data.
Later, the workflow can also trigger training and predictions to/from a machine learning model server such as MLFlow or Kubeflow, where the model artifacts, hyperparameters, and metrics gets stored and served, and includes API management.
Finally, a fullstack app that consumes predicted data can be developed, so it handles authentication/authorization to our backend services and displays results. So to be tightly coupled with the frontend in the browser while being multi-responsive and having the ability to build a multiplatform for phone devices (IOS, Android..) a popular framework of choice is ReactNative with a backend in NodeJS, such as NextJS. So that it can be deployed on any PaaS or mobile market. You can preview the live app here
Tags