No More Confusion of Upstream and Downstream
I often find myself confused by two words in the context of software development: "upstream" and "downstream". They bother me so much that I avoided using them in my own writing and I have to pause whenever I see them. In this blog, I'll show a simple rule that helps remember the difference: downstream adds value to the output of upstream.
Downstream adds value to the output of upstream.
Let's take a break from software development for a moment and look at the oil industry. The sequence of steps typically involves oil drilling, oil refining, and fertilizer production. Two points are crystal clear. First, oil drilling is the upstream and fertilizer plant is the downstream. Second, fertilizer plant adds value to crude oil. Thus, a simple rule emerges to distinguish upstream from downstream: downstream processes add value to the output of upstream processes. Now let's return to software and apply this rule to three examples:
- Service B collecting metric generated by service A: Is Service B upstream or downstream? Service B is the downstream because service B adds value (aggregation) to the output (metrics) of service A.
- HTTP server querying records from the database: Is the HTTP server upstream or downstream? The HTTP server is the downstream because it adds value (transform database records into HTTP responses) to the output (database records) of the database server.
- Load balancer distributing requests across a group of servers: Is the load balancer upstream or downstream? The load balancer is the downstream because it adds value (balancing) to the output (responses) of servers. Notably, Nginx uses the directive, upstream, for backend servers.
More examples of correct usage
Our logging pipeline is a critical service at Cloudflare. Any potential delays or missing data can cause downstream effects that may hinder or even prevent the resolving of customer facing incidents. - An overview of Cloudflare's logging pipeline
It is not easy to setup a true Anycasted network. It requires that you own your own hardware, build direct relationships with your upstream carriers, and tune your networking routes to ensure traffic doesn't "flap" between multiple locations. - A Brief Primer on Anycast
These metrics are very helpful while setting up monitoring and alerting systems to know immediately when an ETL pipeline is lagging behind its upstream source. - Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi
Some confusing bad rules.
Before arriving at the good rule that downstream adds value, I tried several rules that failed to provide consistent identification of upstream and downstream. Inconsistency breeds confusion, making these rules ineffective. Here are some examples.
Bad rule #1: Upstream calls downstream. This rule seems straightforward: if two services are related, one must call the other. However, this perspective oversimplifies the caller-callee relationship, especially in scenarios like the push-vs-pull model. For instance, if Service B collects metrics generated by Service A, the caller-callee dynamic can vary based on whether Service A pushes metrics to Service B or if Service B pulls metrics from Service A.
Bad rule #2: Upstream provides, downstream consumes. While this rule distinguishes the push-vs-pull example above clearly, it becomes ambiguous when the object being provided and consumed is unclear. For example, in the case of an HTTP server querying a database, determining which service provides and which consumes depends on the perspective. If the object is the query, the HTTP server provides and the database consumes; if the object is a database record, the database provides and the HTTP server consumes.
Bad rule #3: Upstream happens before downstream. This rule introduces a temporal aspect to upstream-downstream relationships, assuming that the upstream process always precedes the downstream. However, this perspective raises questions about when the timer starts. Does the HTTP server happen before the database server if we start timing when the HTTP server receives requests, or does the database server happens before the HTTP server if we start timing when it receives SQL queries?