November 01, 2024

Let PostgreSQL Generate Your CSV in Rails

A typical rails export to csv would look like this:

class ExportTicketsController < ApplicationController
  require "csv"

  def index
    report = MemoryProfiler.report do
      @tickets = Ticket.limit(100_000).includes(:location, :user)

      respond_to do |format|
        format.csv do
          csv_data = CSV.generate(headers: true) do |csv|
            csv << ["Ticket#", "Date", "Location", "User", "Ticket Type", "Latitude", "Longitude"]

            @tickets.find_each do |ticket|
              csv << [ticket.ref, ticket.date, ticket.location&.name, ticket.user&.name, ticket.ticket_type, ticket.latitude, ticket.longitude]
            end
          end

          send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
        end
      end
    end
    #report.pretty_print(to: Rails.logger)
    Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
    Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
  end
end

This request generates only 100k records. It takes around 25s on my local (which is usually much faster that my heroku psql instance). It also generates 15 million object allocations

Total memory allocated: 4855459 bytes
Total memory retained: 13197 bytes
Completed 200 OK in 25097ms (Views: 1.2ms | ActiveRecord: 1237.4ms | Allocations: 15971384)

Ok, let's try an alternative. What if we just use 1 query and join the other tables and select only what we need.

  def index
    report = MemoryProfiler.report do
      @tickets = Ticket
        .limit(100_000)
        .joins(:location, :user)
        .select("tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude")

      respond_to do |format|
        format.csv do
          csv_data = CSV.generate(headers: true) do |csv|
            csv << ["Ticket#", "Date", "Location", "User", "Ticket Type", "Latitude", "Longitude"]

            @tickets.each do |ticket|
              csv << [ticket.ticket_number, ticket.date, ticket.ticket_location_name, ticket.ticket_user_name, ticket.ticket_type, ticket.latitude, ticket.longitude]
            end
          end

          send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
        end
      end
    end
    Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
    Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
  end

The query is fast, but the result is not any better:

Ticket Load (79.1ms)  SELECT tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude FROM "tickets" INNER JOIN "locations" ON "locations"."deleted_at" IS NULL AND "locations"."id" = "tickets"."location_id" INNER JOIN "users" ON "users"."id" = "tickets"."user_id" WHERE "tickets"."deleted_at" IS NULL LIMIT $1  [["LIMIT", 100000]]

Total memory allocated: 5445302 bytes
Total memory retained: 1902531 bytes
Completed 200 OK in 48807ms (Views: 1.3ms | ActiveRecord: 110.8ms | Allocations: 18700027)

 

Our customers expect to be able to export data this large and much larger. This is reaching the limits of our heroku dyno. Aside from moving to a background job and creating the file, what can we do?

(Note: Some of the files we need to generate are 0.5-1GB, so even if you're running a Standard-2x dyno worker, you can easily hit memory limits).

I recently had to recover some data from a rollback database and used the postgresql COPY command which can copy to CSV. It felt really snappy. I wondered, can I use that to directly export data from the database to the client?

Let's try...

We'll take the exact same query. 

Then we'll get a raw_connecton and then form our query using the COPY command and use ActiveRecord to generate the query. The copy_data(query) initiates the data export, and get_copy_data reads chunks of the CSV data.

  def index
    report = MemoryProfiler.report do
      @tickets = Ticket
        .limit(100_000)
        .joins(:location, :user)
        .select("tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude")

      connection = ActiveRecord::Base.connection.raw_connection
      query = "COPY (#{@tickets.to_sql}) TO STDOUT WITH CSV HEADER"
      csv_data = ""

      respond_to do |format|
        format.csv do
          connection.copy_data(query) do
            while row = connection.get_copy_data
              csv_data << row
            end
          end

          send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
        end
      end
    end
    Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
    Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
  end

Here's the the whole request.  

Started GET "/export-tickets.csv" for ::1 at 2024-11-01 11:28:38 -0400
Processing by ExportTicketsController#index as CSV
  Rendering text template
  Rendered text template (Duration: 0.1ms | Allocations: 37)
Sent data ticket-2024-11-01.csv (1.2ms)
Total memory allocated: 114804 bytes
Total memory retained: 3057 bytes
Completed 200 OK in 1823ms (Views: 1.0ms | ActiveRecord: 13.8ms | Allocations: 662510)

Let's compare:

  Using Ruby Using PostgreSQL
Time 25s 1.8s
Memory Allocated 4.8mb 0.11mb
Memory Retained 0.0131mb 0.003mb
Allocations 15.9 million 0.662 million

Have you ever used this method before? I have't seen it.

 


UPDATE

John, Andy and I have been discussing this idea here.

The question is does this require any more work on the database. My layperson answer is that the query still hast to be performed in both scenarios and the database either has to return the results in a string (the COPY command) or an object (AR). The speed improvement comes from skipping the object creation. John asked, what happens if you just execute a raw query and then go directly to Ruby CSV?

I did a video on this a while back. It's much faster. But, still not as fast as using the COPY command and also uses less memory. Here's the results in this scenario:

  def index
    report = MemoryProfiler.report do
      @tickets = Ticket
        .limit(100_000)
        .joins(:location, :user)
        .select("tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude")

      connection = ActiveRecord::Base.connection

      data = connection.execute(@tickets.to_sql)

      respond_to do |format|
        
        format.csv do
          csv_data = CSV.generate(headers: true) do |csv|
            csv << ["ID", "Ticket#", "Date", "Location", "User", "Ticket Type", "Latitude", "Longitude"]

            data.values.each do |row|
              csv << row
            end
          end

          send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
        end
      end
    end
    Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
    Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
  end

 And here are the results:

Total memory allocated: 1272766 bytes
Total memory retained: 3110 bytes
Completed 200 OK in 6861ms (Views: 1.0ms | ActiveRecord: 98.5ms | Allocations: 5149013)
  Using Ruby AR Using Raw Ruby Using PostgreSQL
Time 25s 6.8s 1.8s
Memory Allocated 4.8mb 1.27mb 0.11mb
Memory Retained 0.0131mb 0.003mb 0.003mb
Allocations 15.9 million 5.1 million 0.662 million

One main difference is that because the raw ruby method is using regular ruby arrays and not AR objects, I think it can be garbage collected and so the retained memory is much smaller. The PostgreSQL method still uses a ruby string so that tracks that they're about the same retention.