Skip to content

Commit 0222364

Browse files
authored
Merge pull request #237 from valor-software/development
feat(blog): medusa app and github
2 parents 3f1279e + 913bbdd commit 0222364

7 files changed

Lines changed: 97 additions & 1 deletion

File tree

assets/articles/0055-rendering-nativescript-angular-templates-and-components-into-images/rendering-nativescript-angular-templates-and-components-into-images.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ class ContentViewDummy extends ContentView {
140140
}
141141
----
142142
143-
Now we just need to make sure that it’s visibility is set to collapse and use a very convenient method from the AppCompatActivity (https://developer.android.com/reference/androidx/appcompat/app/AppCompatActivity#addContentView(android.view.View,android.view.ViewGroup.LayoutParams)[addContentView, window=_blank]) to add the view to the root of the activity, essentially adding it to the window but completely invisible.
143+
Now we just need to make sure that it’s visibility is set to collapse and use a very convenient method from the AppCompatActivity (https://developer.android.com/reference/androidx/appcompat/app/AppCompatActivity[addContentView, window=_blank]) to add the view to the root of the activity, essentially adding it to the window but completely invisible.
144144
145145
[, js]
146146
----
13.1 KB
Loading
12.2 KB
Loading
44.9 KB
Loading
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
== The Problem
2+
3+
Here at Valor Software we had the challenge of analyzing some metrics from developers productivity. So, we've started questioning ourselves: What are the core daily activities of a developer? On a macro level we could say that is to deliver clean and reliable code on top of a consistent base, but with which commit frequency? And how to correlate code deployment with bugs and other stuff? Is this related to teams, to specific projects or even to the used technology stack?
4+
5+
In Data Science, we usually start our investigations based on the scientific method (and https://www.datascience-pm.com/crisp-dm-2/[CRISP-DM approach, window=_blank]), to know better the target situation, its surroundings and mainly doing the questions that will drive us to catch the root cause of our problems.
6+
7+
OK, so much fancy stuff. What is the relationship between all of this and the Valor Software Medusa project? We realized that GitHub is a good provider of data when it comes to developers productivity as we can track information about repositories, commits, tests and much more, this drove us to develop a Data Pipeline to extract and process data from it.
8+
9+
=== The Solution
10+
11+
We have designed a business framework containing the daily core activities of a developer and have splitted it into four Pilars, when it comes to GitHub:
12+
13+
* `*__Version Control Management__*`
14+
* `*__Compatibility__*`
15+
* `*__Infrastructure__*`
16+
* `*__Team__*`
17+
18+
The problem organization is a very important step in Data Science/Engineering projects as it gives us the direction of what data sources we should consume what metrics we are willing to build, as makes no sense to build a rocket dashboard, having metrics that are not related to the operation.
19+
20+
=== How it was built
21+
22+
The idea is based in hitting the GitHub API, collecting the necessary metrics from a range of endpoints, saving the result in an intermediary layer and than load it to the DataBase.
23+
24+
`*Why do we use an intermediary layer rather than saving it directly to the DataBase?*`
25+
26+
We usually follow this approach in order to make the pipeline more resilient. Imagine we spent hours iterating over API pagination, and than some error occurs. In some cases we can suffer data loss and have to restart it all again. Saving the data in an intermediary layer such as AWS S3 or Google Storage makes the pipeline to execute in steps, and also allows us to process the data later, use it in Data Science experiments and so on.
27+
28+
The application design is based on OOP and contains the following mechanisms:
29+
30+
* `*__App Deployment__*`
31+
* `*__Pipeline orchestration__*`
32+
* `*__Storage layer__*`
33+
* `*__Visualization layer__*`
34+
* `*__Infrastructure management__*`
35+
* `*__Data pipeline source code__*`
36+
37+
==== App deployment
38+
39+
The application deployment is done using Docker and the containers needed to run Airflow with its services are all described in a docker-compose file
40+
41+
==== Pipeline Orchestration
42+
43+
The triggering of the data ingestion and processing jobs can be done throughout the Airflow UI, which uses DAGs to manage all the working code (DAGs stands by Direct Acyclic Graphs and are responsible for managing the tasks of the data pipeline)
44+
45+
=== Storage Layers
46+
There are two of them in this project. One is the intermediary layer, that stores the raw data from the API calls, organizing it into year/month/day of the request. The other one is the Data warehouse, a database based on PostgreSQL to store the tables containing the processed information.
47+
48+
=== Visualization layer
49+
The chosen app for visualizing data at our data warehouse is https://superset.apache.org/[Apache Superset, window=_blank]. Considering it is free, Superset is an incredible tool. From my experience it has most of the features we can find in the famous and paid Power BI. In addition, Superset is also ready for streaming needs and is cluster scalable.
50+
51+
=== Infrastructure management
52+
The infra is deployed at the Google Cloud, and the necessary resources are created and managed by https://www.terraform.io/[Terraform, window=_blank]
53+
54+
=== Data pipeline source code
55+
The code design is based on two strong objects which are intended to interact with each other so as to ingest, process and write data from different data sources to different destinations, all based on json configuration files.
56+
57+
`*- Hook:*` Responsible for interface with external services, like the GitHub API, the Cloud storage (GCP) and the Data warehouse, holding its credentials and authentication methods.
58+
59+
`*- Operator:*` Responsible for different methods operation (call) on top of the data and trigger functions like:
60+
61+
* `*__Download data__*`
62+
* `*__Filter data__*`
63+
* `*__Calculate metrics__*`
64+
* `*__Upload logs__*`
65+
66+
also holding configurations, restrictions and other information about the data object that is being ingested.
67+
68+
For the specific case of GitHub, the authentication is done using the account token. It can be a User or an Organization, resulting in a flexible object as User and Organizations have different API calls to retrieve similar categories of data. Just like in the representation bellow:
69+
70+
[.img]
71+
image::imag1.png[]
72+
73+
Once the data is requested from the API, based on the configuration file, it is stored in Google Cloud Storage.
74+
Once the data is properly downloaded to this intermediary layer, the Operator calls the configuration file to filter the correct information from the raw data, opening a way to the next step: data transformation, and metrics calculation.
75+
76+
=== Conclusion
77+
78+
The creation of data pipelines can become something really complex if we do not care about details like the creation of generic functions, the config files approach, as per reuse code and make the processing more flexible.
79+
80+
So, at the end we have a data warehouse to consume data, making it available to the Medusa App and to the dashboard tool. This way, managers, product managers or PO's can create their own views, test their hypothesis or even find answers with the power of the data.
81+
82+
View of the data pipeline architecture:
83+
[.img]
84+
image::imag2.jpg[]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"title": "Valor Software Medusa app and GitHub",
3+
"order": 58,
4+
"domains": ["devops_cloud"],
5+
"authorImg": "assets/articles/valor-software-medusa-app-and-github/Robson_Muller.jpeg",
6+
"language": "en",
7+
"bgImg": "assets/articles/valor-software-medusa-app-and-github/valor-software-medusa-app-and-github.png",
8+
"author": "Robson Müller",
9+
"position": "Data Engineer ",
10+
"date": "Wed Jan 13 2023 10:45:55 GMT+0000 (Coordinated Universal Time)",
11+
"seoDescription": "The challenge of analyzing some metrics from developers productivity"
12+
}
486 KB
Loading

0 commit comments

Comments
 (0)