Let PostgreSQL Generate Your CSV in Rails
A typical rails export to csv would look like this:
class ExportTicketsController < ApplicationController
require "csv"
def index
report = MemoryProfiler.report do
@tickets = Ticket.limit(100_000).includes(:location, :user)
respond_to do |format|
format.csv do
csv_data = CSV.generate(headers: true) do |csv|
csv << ["Ticket#", "Date", "Location", "User", "Ticket Type", "Latitude", "Longitude"]
@tickets.find_each do |ticket|
csv << [ticket.ref, ticket.date, ticket.location&.name, ticket.user&.name, ticket.ticket_type, ticket.latitude, ticket.longitude]
end
end
send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
end
end
end
#report.pretty_print(to: Rails.logger)
Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
end
end
This request generates only 100k records. It takes around 25s on my local (which is usually much faster that my heroku psql instance). It also generates 15 million object allocations
Total memory allocated: 4855459 bytes
Total memory retained: 13197 bytes
Completed 200 OK in 25097ms (Views: 1.2ms | ActiveRecord: 1237.4ms | Allocations: 15971384)
Ok, let's try an alternative. What if we just use 1 query and join the other tables and select only what we need.
def index
report = MemoryProfiler.report do
@tickets = Ticket
.limit(100_000)
.joins(:location, :user)
.select("tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude")
respond_to do |format|
format.csv do
csv_data = CSV.generate(headers: true) do |csv|
csv << ["Ticket#", "Date", "Location", "User", "Ticket Type", "Latitude", "Longitude"]
@tickets.each do |ticket|
csv << [ticket.ticket_number, ticket.date, ticket.ticket_location_name, ticket.ticket_user_name, ticket.ticket_type, ticket.latitude, ticket.longitude]
end
end
send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
end
end
end
Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
end
The query is fast, but the result is not any better:
Ticket Load (79.1ms) SELECT tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude FROM "tickets" INNER JOIN "locations" ON "locations"."deleted_at" IS NULL AND "locations"."id" = "tickets"."location_id" INNER JOIN "users" ON "users"."id" = "tickets"."user_id" WHERE "tickets"."deleted_at" IS NULL LIMIT $1 [["LIMIT", 100000]]
Total memory allocated: 5445302 bytes
Total memory retained: 1902531 bytes
Completed 200 OK in 48807ms (Views: 1.3ms | ActiveRecord: 110.8ms | Allocations: 18700027)
Our customers expect to be able to export data this large and much larger. This is reaching the limits of our heroku dyno. Aside from moving to a background job and creating the file, what can we do?
(Note: Some of the files we need to generate are 0.5-1GB, so even if you're running a Standard-2x dyno worker, you can easily hit memory limits).
I recently had to recover some data from a rollback database and used the postgresql COPY
command which can copy to CSV. It felt really snappy. I wondered, can I use that to directly export data from the database to the client?
Let's try...
We'll take the exact same query.
Then we'll get a raw_connecton
and then form our query using the COPY
command and use ActiveRecord to generate the query. The copy_data(query)
initiates the data export, and get_copy_data
reads chunks of the CSV data.
def index
report = MemoryProfiler.report do
@tickets = Ticket
.limit(100_000)
.joins(:location, :user)
.select("tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude")
connection = ActiveRecord::Base.connection.raw_connection
query = "COPY (#{@tickets.to_sql}) TO STDOUT WITH CSV HEADER"
csv_data = ""
respond_to do |format|
format.csv do
connection.copy_data(query) do
while row = connection.get_copy_data
csv_data << row
end
end
send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
end
end
end
Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
end
Here's the the whole request.
Started GET "/export-tickets.csv" for ::1 at 2024-11-01 11:28:38 -0400
Processing by ExportTicketsController#index as CSV
Rendering text template
Rendered text template (Duration: 0.1ms | Allocations: 37)
Sent data ticket-2024-11-01.csv (1.2ms)
Total memory allocated: 114804 bytes
Total memory retained: 3057 bytes
Completed 200 OK in 1823ms (Views: 1.0ms | ActiveRecord: 13.8ms | Allocations: 662510)
Let's compare:
Using Ruby | Using PostgreSQL | |
Time | 25s | 1.8s |
Memory Allocated | 4.8mb | 0.11mb |
Memory Retained | 0.0131mb | 0.003mb |
Allocations | 15.9 million | 0.662 million |
Have you ever used this method before? I have't seen it.
John, Andy and I have been discussing this idea here.
The question is does this require any more work on the database. My layperson answer is that the query still hast to be performed in both scenarios and the database either has to return the results in a string (the COPY command) or an object (AR). The speed improvement comes from skipping the object creation. John asked, what happens if you just execute a raw query and then go directly to Ruby CSV?
I did a video on this a while back. It's much faster. But, still not as fast as using the COPY command and also uses less memory. Here's the results in this scenario:
def index
report = MemoryProfiler.report do
@tickets = Ticket
.limit(100_000)
.joins(:location, :user)
.select("tickets.id as id, ref as ticket_number, date as Date, locations.name as ticket_location_name, users.name as ticket_user_name, ticket_type, tickets.latitude, tickets.longitude")
connection = ActiveRecord::Base.connection
data = connection.execute(@tickets.to_sql)
respond_to do |format|
format.csv do
csv_data = CSV.generate(headers: true) do |csv|
csv << ["ID", "Ticket#", "Date", "Location", "User", "Ticket Type", "Latitude", "Longitude"]
data.values.each do |row|
csv << row
end
end
send_data csv_data, filename: "ticket-#{Date.today}.csv", type: "text/csv"
end
end
end
Rails.logger.info "Total memory allocated: #{report.total_allocated} bytes"
Rails.logger.info "Total memory retained: #{report.total_retained} bytes"
end
And here are the results:
Total memory allocated: 1272766 bytes
Total memory retained: 3110 bytes
Completed 200 OK in 6861ms (Views: 1.0ms | ActiveRecord: 98.5ms | Allocations: 5149013)
Using Ruby AR | Using Raw Ruby | Using PostgreSQL | |
Time | 25s | 6.8s | 1.8s |
Memory Allocated | 4.8mb | 1.27mb | 0.11mb |
Memory Retained | 0.0131mb | 0.003mb | 0.003mb |
Allocations | 15.9 million | 5.1 million | 0.662 million |
One main difference is that because the raw ruby method is using regular ruby arrays and not AR objects, I think it can be garbage collected and so the retained memory is much smaller. The PostgreSQL method still uses a ruby string so that tracks that they're about the same retention.