// Technical Writing

const blog = {

Insights on backend development, system architecture, performance optimization, and lessons learned building scalable applications.

August 6, 2025 10 min read

How I used Zero-Copy To Achieve Blazingly Fast File Transfers

Zero copy. Sounds fancy right? But the idea is quite simple. Just as the name suggests, it moves data with (nearly) zero CPU overhead, and zero detours through your application’s memory and just directly from kernel to kernel to achieve those blazingly fast transfer speeds.

This is what file transfers usually look like in Go:

io.Copy(w, file)

Looks neat, looks clean, gets the job done. But wait, what is it actually doing?

Behind the scenes, io.Copy() is quietly transferring data from kernel space to userspace and vice-versa. It mostly uses read() and write() syscalls to fetch data from the file cache to application buffer and send data from application buffer to socket buffer. That’s already two syscalls and multiple context switches and extra work for the CPU. It might not even matter for small files but when we’re dealing with Gigabytes of data, this slowly turns into a silent bottleneck.

Fig: The inefficiencies of io.Copy()

So, what do we do? I was searching for a better alternative and it looks like there’s already a syscall in Linux: sendfile()that allows to copy data directly from the file cache to socket buffer in kernel space.

Well this solves the major problem but I was flooded with my internal monologues.

How do you implement it? Does it actually work? Are there any significant differences? Are there any tradeoffs?

Only way to find out was to get my hands dirty with some code and here we go.

Before I talk about sendfile(), this is how I implemented the same thing using io.Copy():

func handleWithIOCopy(w http.ResponseWriter, _ *http.Request) {
fmt.Println("Handling with io.Copy")

file, err := os.Open(videoPath)
if err != nil {
http.Error(w, "File not found", http.StatusNotFound)
fmt.Println("Error opening file:", err)
return
}
defer file.Close()

stat, err := file.Stat()
if err != nil {
http.Error(w, "Failed to stat file", http.StatusInternalServerError)
return
}

size := stat.Size()
w.Header().Set("Content-Type", "video/mp4")
w.Header().Set("Content-Length", strconv.FormatInt(size, 10))

io.Copy(w, file)
fmt.Println("Video streamed with io.Copy")
}

Now, let’s dump the comfy io.Copy() abstraction and dive deeper into the world of raw syscalls. But wait, what are syscalls actually?

Syscalls (also known as system calls) are the gateway or the interfaces through which the userspace application communicates with the kernel. They are how the applications ask stuffs like read from a file, write to a scoket, allocate memory and more. The applications cannot directly access the kernel or hardware so the only way to communicate is via syscalls.

Now here’s the catch: syscalls are slow.

Why?

  • Multiple mode switches for the CPU
  • Blocking behavior
  • Inefficiency at scale

More on the syscalls on another blog. For now, let’s look at how we achieve Zero-Copy using sendfile().

sendfile()

sendfile() is a syscall provided by linux that helps to cut off that extra layer of userspace entirely. No need to ping pong the data between kernel space and user space and just do all the transfers directly in the kernel space without ever involving the application.

Fig: io.Copy() vs syscall.Sendfile()

Implementing sendfile() in Go

Now that we have discussed what is sendfile() and why is it important. I wanted to implement it in a Go server. But the standard Go HTTP stack doesn’t expose sendfile() directly and we need to use the syscall package. I’ll be referring to syscall.Sendfile() or sendfile() interchangeably in this article and both should mean the same thing.

But syscall.Sendfile() uses file descriptors and http.ResponseWriter does not expose them directly. So, to work around this we need to hijack the underlying TCP connection

Let us see how we can do this:

// Try to cast responsewriter to http.Hijacker and get access to
// underlying TCP connection kj,x
hijacker, ok := w.(http.Hijacker)
if !ok {
http.Error(w, "Hijacking not supported", http.StatusInternalServerError)
return
}

conn, _, err := hijacker.Hijack()
if err != nil {
http.Error(w, "Failed to hijack connection", http.StatusInternalServerError)
return
}
defer conn.Close()

Here, we have hijacked the connection and we have bypassed Go’s response life cycle and are completely responsible for managing the connection which includes writing headers manually, streaming files ourselves and properly closing the socket after file transfer.

But wait, hijacker.Hijack() gives us the generic net.Conn interface but to use file descriptors, we need to assert this into *net.TCPConn

 tcpConn, ok := conn.(*net.TCPConn)
if !ok {
fmt.Println("Connection is not TCP")
return
}
defer tcpConn.Close()

Now, since sendfile() uses input file descriptor for the file, we open the file and extract the fd:

 outFile, err := tcpConn.File()
if err != nil {
fmt.Println("Failed to get file from TCPConn:", err)
return
}
defer outFile.Close()
outFD := int(outFile.Fd())

file, err := os.Open(videoPath)
if err != nil {
fmt.Fprintf(conn, "HTTP/1.1 404 Not Found\r\n\r\n")
fmt.Println("File open error:", err)
return
}
defer file.Close()

The next step is to call file.Stat() to get the information about the file and extract the size of the file. This is important because sendfile() doesn’t know how much data to send unless we explicitly tell them to. Also, we set the headers manually by using fmt.Fprintf since Go no longer handles them for us.

 stat, err := file.Stat()
if err != nil {
fmt.Fprintf(conn, "HTTP/1.1 500 Internal Server Error\r\n\r\n")
fmt.Println("Stat error:", err)
return
}

size := stat.Size()
inFD := int(file.Fd())

// Write headers manually
fmt.Fprintf(conn, "HTTP/1.1 200 OK\r\n")
fmt.Fprintf(conn, "Content-Type: video/mp4\r\n")
fmt.Fprintf(conn, "Content-Length: %d\r\n", size)
fmt.Fprintf(conn, "Connection: close\r\n\r\n")

Now we finally send the file using syscall.Sendfile()

 var offset int64 = 0
for offset < size {
sent, err := syscall.Sendfile(outFD, inFD, &offset, int(size-offset))
if err != nil {
fmt.Println("Sendfile error:", err)
return
}
if sent == 0 {
break
}
}

We have successfully implemented zero copy in go using syscall.sendfile()

What do the numbers say?

Now the actual question is: “Is it worth it?” because we wouldn’t want to do all these extra steps to achieve negligible results right? So let’s look at the some of the benchmarks I did locally.

Specifications used

  • CPU: AMD Ryzen 5 4600H (6 cores, 12 threads)
  • Max Clock Speed: 3.0 GHz
  • Architecture: x86_64
  • RAM: 16 GB DDR4
  • OS: Ubuntu 24.04.2 LTS
  • Virtualization: AMD-V
  • L3 Cache: 8 MiB

Simulation environment

CPU Usage

Fig: CPU usage comparison: io.Copy() vs sendfile()

As expected CPU usage is significantly lower when using sendfile() because it eliminates the need to move data back and forth between user space and kernel space. As seen in the graph, io.Copy() consistently consumed around 480–500% CPU (across multiple cores), while sendfile() remained steady between 110–130%. This difference is due to sendfile() operating entirely within the kernel, reducing context switches and memory copying overhead. In larger scale data operations this creates a massive difference and using sendfile() allows to increase the overall responsiveness of the system.

Memory Usage

Fig: Memory Usage Comparison: io.Copy vs sendfile()

The memory usage suggests another reason to use sendfile() over io.Copy(). There are higher amount of page faults initially when using io.Copy() indicating frequent memory access and user-kernel transactions while sendfile() has fewer and stable page faults. Also, the memory usage is higher in sendfile() but reduces quickly and remains consistent around 13–14 mb consistently during the entire time whereas the memory usage in io.Copy() starts slow but increases rapidly in comparison to sendfile(). This is extremely important in resource constrained environments like embedded devices, edge devices, etc. where every megabyte counts.

Latency

Fig: Latency comparison: io.Copy() vs sendfile()

We can see a clear difference when it comes to latency where sendfile() boasts 2.6x lower latency than that of io.Copy() making it superior in latency sensitive systems.

Request throughput performance

Fig: Request throughput comparison: sendfile() vs io.Copy()

The difference is yet evident on request throughput performance where sendfile() was able to handle 32 requests/s whereas io.Copy() was able to only handle 19 requests/s where sendfile() clearly has more advantanges especially in systems that needs to handle high concurrency, high volume workloads, such as media servers or content delivery platforms.

Transfer Rate

Fig: Transfer rate comparison: sendfile() vs io.Copy()

The difference is quite visible when it comes to data transfer rates where sendfile() was able to hit 9.54 GB/s while io.Copy() capped out at 5.86 GB/s. That’s around 1.6x better, and again, it’s all because sendfile() skips the userspace detour. No need to move bytes into your app just to throw them back into a socket. This direct kernel-to-kernel transfer ends up making a big difference especially if you're dealing with large files or streaming workloads where raw transfer speeds actually matter.

wrk raw output (for the geeks)

# For io.Copy()
Running 2m test @ http://localhost:8080/video?method=io
6 threads and 30 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.48s 224.64ms 2.00s 65.26%
Req/Sec 4.53 4.04 38.00 85.19%
2326 requests in 2.00m, 704.39GB read
Socket errors: connect 0, read 0, write 0, timeout 81
Requests/sec: 19.35
Transfer/sec: 5.86GB

# For sendfile()
Running 2m test @ http://localhost:8080/video?method=sendfile
6 threads and 30 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 567.85ms 273.85ms 1.82s 59.98%
Req/Sec 4.95 0.66 12.00 83.52%
3821 requests in 2.01m, 1.12TB read
Requests/sec: 31.67
Transfer/sec: 9.54GB

When not to use sendfile()

After going through all this you might be like “Screw userspace buffers, I’m using zero copy for everything”. I’m sorry my friend, it has it’s own limitations and you should not use sendfile() especially when:

  • You refuse to use linux: I wouldn’t say “You don’t have linux” because linux is openly available to everyone. But if you decide to use something else like macos, sendfile() doesn’t work like it does on linux and has been reported to be slower than io.Copy() and for the windows users I’m sorry I don’t have any words for you.
  • You need to modify data before transmission: Since sendfile() skips userspace, you cannot do stuffs like compression, encryption, etc.
  • You are not dealing with file descriptors: Hear me out, previously it was not allowed to take socket as the input when using sendfile() but with latest linux versions ( 5.12+ ), sendfile() now can take socket as input but I just listed this out as trivia.
  • You are working with HTTPS: Yes, you cannot use sendfile() with HTTPS because you cannot sign or encrypt data in kernel space. TLS expects to handle encryption in userspace which sendfile() bypasses entirely. BUT.. you can offload this to the kernel using ktls however Go’s official tls library natively doesn’t support ktls (well proposal has been submitted and accepted so we might see it in the future but it has lot of issues and is relatively premature to be rolling out soon)

Conclusion

So yeah, zero-copy with sendfile() isn't just some theoretical optimization, it's a real, practical performance boost that can seriously level up your file transfer speeds. Sure, it's not perfect and has its own imperfections, but if you're serving large static files, running a media server, or just want to push bytes fast without melting your CPU, then it's definitely worth getting your hands a little dirty.

Your CPU deserves a break. Let sendfile() take over

Here’s the github repo with the implementation: https://github.com/probablysamir/zero-copy-go.git

Read on Medium
December 17, 2023 8 min read
web-developmentscaling

Vertical vs Horizontal Scaling

Surprise! The side project you’ve been working on just exploded in popularity, leading to a traffic surge beyond imagination. Suddenly, it dawned on you: the humble server you initially chose is no longer up to the task.

The sudden increase in traffic

You have two options now: Test the patience of your users or adjust your product to the increasing traffic. If you chose the later option then that’s where you encounter scalability.

Scalability

Scalability is just about how well your servers can handle the surge in traffic. When there’s an sudden increase in traffic, the system requires more resources like bandwidth, server capacity and processing power. If the server is not well scaled then the users may experience slow loading times, unresponsiveness and may even encounter server crashes which leads to inconvenient user experience.

Users after encountering frequent server crashes

There again are two options: Getting a powerful server (vertical scaling) or increasing the number of servers (horizontal scaling). Now let’s dive deeper to get better insights.

Vertical Scaling

Vertical scaling also known as scaling up is accomplished by upgrading the hardware and/or network throughput of a single server or system to handle a higher volume of workload or demand. In short, you need a powerful machine to handle all your needs. So how can you scale up vertically?

Vertical scaling

Adding storage

The data is continuously increasing in your server. You need to add sufficient storage to the server to store the increasing data.

Upgrading storage

Using a traditional hard disk leads to slower read write operation. So, replacing the hard disks with SSDs will significantly improve the time in storing and retrieving data from the disk.

Increasing RAM

Less amount of RAM means less working memory and cache memory for the system. So, increasing the RAM increases the amount of working memory and cache memory thus making the server more efficient.

Upgrading NICs

The network throughput in your server is equally important. Even more if you are continuously streaming media from your server. Thus upgrading or adding NIC to your server will increase it’s overall network throughput.

Upgrading CPU

Sometimes the problem lies with the CPU itself. It’s time to toss the decade old CPUs and replace them with new ones. You can also add more processors or virtual cores ( on cloud instances or VM ) to have a faster server.

Advantages of Vertical Scaling

Easier to implement

Vertical scaling is relatively easier as we do not need to redesign the code for implementation.

Cheaper Initial Cost

The initial cost of vertical scaling is relatively low because we can upgrade the server at an affordable cost.

Disadvantages of Vertical Scaling

Limit in scaling

While vertically scaling a system, there reaches a point where we cannot upgrade the system further because of the hardware and software constraints.

Higher cost at later stages

There exists a point after which the cost of hardware upgrade is extremely high as compared to the initial upgrades.

Lack of fault tolerance

If we vertically scale a single machine to handle the workload, the whole server will crash if there is a problem in the system.

Peak Demand Charges

We need to pay for the upgraded machine even if the workload is low which might lead to unnecessary working fees.

Horizontal Scaling

Horizontal scaling, also known as scaling out, is a method of increasing system capacity by adding more machines or nodes to your network. It’s like running each component on a separate server and being able to add more servers whenever necessary. In short, we are using many small machines instead of one big one. So, let’s see how we can achieve this.

Horizontal scaling

Load Balancing

Load balancing is a technique used to distribute the incoming traffic to different servers making sure that no single server is overloaded and all the machines share the load. When there is an incoming request the load balancer will use the suitable algorithm to distribute the traffic. This helps the system to scale horizontally as we can add more servers and they will all share the load.

Load balancing working diagram

Clustering

Clustering is a technique of grouping together multiple servers to work collaboratively as a single system. The servers in the cluster have awareness of one another in the group. It seems similar to normal load balancing but in load balancing the servers are working independently but the servers in the cluster work together to achieve the objective. Clusters are also more redundant and scalable than normal load balancing.

Simple diagram of a active-active cluster

Replication

Replication is a technique to create copies of database or database nodes to provide better fault tolerance and availability. The database copies are synchronized with each other to ensure consistency. The requests can come into different nodes thus decreasing the load in a specific node. Different types of databases uses different techniques for replication. In master-slave replication technique used by most relational databases, the master is used for write operations and the data is asynchronously propagated to the slaves and read operation can be performed on any node. In case the master node fails one of the slave is chosen as master node.

Design of a working replication mechanism

Sharding

Sharding is a technique of dividing or partitioning a large database vertically or horizontally into smaller pieces called shards. Unlike replication each shard doesn’t have a copy of all data but rather it is a subset of the large database. It helps to improve performance, scalability and parallelism by distributing the workload across multiple servers or nodes.

Sharding

Microservices Architecture

Microservices architecture is a software development architecture where a complex application is divided into smaller independent services that can be developed, tested and scaled independently. Each service is called a microservice and has it’s own role and communicates with other services through well defined APIs. Since the microservices are loosely coupled we can deploy them separately thus helping in horizontal scalability.

Containerization and Orchestration

Containerization is a process which encapsulates application and their dependencies into lightweight, portable units called containers. This ensures consistency across different environments that allows for rapid deployment. Orchestration is the process of automating the management, deployment and scaling of these containers. So, when the workload increases, the containers can be added seamlessly to the system. So these two together help to seamlessly scale and handle the varying workloads.

Content Delivery Network (CDN)

CDN is a distributed network of servers placed at multiple locations that caches website content. It offloads a significant portion of traffic from the main server by providing the cached content directly. It also helps in availability of content as if one edge server experiences issues, requests are automatically rerouted to alternative servers that host copies of the same content.

Single server vs CDN

Advantages of Horizontal Scaling

Higher Scalability

You can horizontally scale a system many times more than you can achieve with a vertical scaling.

Redundancy

The system becomes more redundant and there’s a backup even if a server fails.

Cheaper Final Cost

Even if the initial cost is higher, the cost at the later stages is cheaper in comparison to vertical scaling.

No Peak Demand Charges

As you can scale according to demand there’s no need to pay for the peak demand capacity if not needed.

Disadvantages of Horizontal Scaling

Complex implementation

It is harder to implement because it needs significant effort to change the architecture of the application to add later thus it is wise to consider it before the application is even built.

Higher Initial Cost

The initial cost of horizontal scaling is higher than vertical scaling till a certain point.

Complex management

Managing a lot of servers is harder than managing a single server as there are more potential points of failure and require more effort to monitor and maintain.

Horizontal or vertical scaling. What should I use?

After delving into the depths of this extensive article, the bottom line is: What’s the ultimate choice for you? Let’s cut to the chase and discover the perfect fit!

There’s actually no hard answer but actually depends on your use case. Firstly we need to consider the running costs and manpower. If your service has low to moderate workload and has a steady predictable growth that can be accommodated within the limits of a single server then it is advised to go for vertical scaling.

Similarly, if your workload is high or unpredictable and requires high availability and fault tolerance then it is advisable to go for horizontal scaling.

If you are doing a large-scale deployment then it might be wiser to go for a hybrid approach. Hybrid approach is like having the best of both worlds, you can vertically scale specific components to improve their performance and horizontally scale when you require additional capacity.

Read on Medium

}; // End of Blog

Want to discuss these topics or collaborate?

> get_in_touch()

© 2026 Samir Kattel // Built with SvelteKit + TailwindCSS