Improving open-uri
When working on the Shrine library for handling file uploads, in multiple places I needed to be able to download a file from URL. If you know the Ruby standard library well, the solution might be obvious to you: open-uri.
require "open-uri"
result = open("http://example.com/image.jpg")
result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>
Open-uri is something that I indeed very much wanted to use for my use case. It ships with Ruby, so there are no external dependencies (just Net::HTTP), and it has many benefits:
- downloads to a unique filesystem location (using Tempfile)
- supports HTTP/HTTPS/FTP links
- follows redirects
- memory efficient
- easy basic authentication
- easy proxy
However, also considering that in my case the URL could come from user input, open-uri turned out to have many limiations and quirks:
- Using
Kernel#open
makes you vulnerable to remote code execution - If the remote file is smaller than 10KB, open-uri actually returns a StringIO instead of a Tempfile
- URL’s file extension isn’t preserved in downloaded Tempfile
- You cannot limit maximum number of redirects
- You cannot limit maximum filesize
I’ve thought about alternatives: rest-client, curl
or wget
. However,
rest-client was a too heavy dependency just for downloading, and I didn’t want
to depend on external CLI tools. Also, none of them were able to properly limit
the maximum filesize, which I found important in context of Shrine.
So, realizing that I still wanted to use open-uri, I decided to make a wrapper around it that addresses these limitations. I want to guide you through my journey, fixing one issue at a time.
Improvements
Kernel#open
Ruby has a Kernel#open
method, which given a file path acts as File.open
.
but given a string that starts with “|”, it interprets it as a shell command
and returns an IO connected to the spawned subprocess:
open("| ls") # returns an IO connected to the `ls` shell command
Open-uri extends Kernel#open
with the ability to accept URLs. However, if the
URL is coming from user input, we should never pass it to Kernel#open
,
because different users have different ideas on what is a “URL”; someone might
think that | rm -rf ~
is a nice looking URL.
A little known fact is that Kernel#open
just delegates to
URI::(HTTP|HTTPS|FTP)#open
, and we can simply use that instead:
uri = URI.parse("http://example.com/image.jpg") #=> #<URI::HTTP>
uri.open #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>
StringIO
Stangely, if the remote file has less than 10KB, open-uri will actually return a StringIO instead of a Tempfile.
uri.open #=> #<StringIO>
In context of Shrine I wanted the returned IO to always be a file, for consistency and because it could later be given for processing. We can easily fix that:
io = uri.open
if io.is_a?(StringIO)
downloaded = Tempfile.new
File.write(downloaded.path, io.string)
else
downloaded = io
end
downloaded # now always a Tempfile
File extension
Surprisingly, open-uri always creates a Tempfile without a file extension, even if the url has one. In Shrine I wanted that downloaded files (which will later be uploaded) always have an extension if it’s known.
So let’s copy the downloaded IO to a new Tempfile which has a file extension,
but use mv
if we can so that we don’t pay any performance penalty (and that
the old file also gets deleted):
io = uri.open
downloaded = Tempfile.new([File.basename(uri.path), File.extname(uri.path)])
if io.is_a?(Tempfile)
FileUtils.mv io.path, downloaded.path
else # StringIO
File.write(downloaded.path, io.string)
end
File.extname(downloaded.path) #=> ".jpg"
Redirects
What’s good is that open-uri can automatically follow redirects. What’s bad is that we cannot limit the maximum number of redirects. This allows the attacker to give a URL which causes a redirect loop, and open-uri would continue making requests forever. To be fair, open-uri has a detection for redirect loops, but only if URLs repeat.
So we disable open-uri’s following of redirects, which now raises
OpenURI::HTTPRedirect
on redirects, allowing us to reimplement it:
tries = 3
begin
uri.open(redirect: false)
rescue OpenURI::HTTPRedirect => redirect
uri = redirect.uri # assigned from the "Location" response header
retry if (tries -= 1) > 0
raise
end
Maximum filesize
Since the URL can sometimes come from the user input, I wanted to give Shrine
users the ability to limit maximum filesize of the remote file. Specifically, I
wanted that download aborts as soon as the “Content-Length” header reveals that
the file will be too large. Luckily, open-uri has the :content_length_proc
option, which calls the given proc as soon as open-uri reads “Content-Length”:
uri.open(
content_length_proc: ->(size) { raise FileTooLarge if size > max_size },
)
However, an attacker could theoretically create an app which returns large
files, but where the “Content-Length” response header is ommited on purpose.
Luckily, open-uri has got our back on this one too with :progress_proc
, which
calls the given proc whenever a chunk is downloaded, with the current size.
That means we can add it as a fallback in case “Content-Length” is missing:
uri.open(
content_length_proc: ->(size) { raise FileTooLarge if size && size > max_size },
progress_proc: ->(size) { raise FileTooLarge if size > max_size },
)
User agent
It turns out that when we’re making requests to an application, but we don’t include a “User-Agent” header, most applications will start rejecting our requests after some time.
Open-uri doesn’t include a “User-Agent” by default, but allows us to easily add one, since open-uri treats any unknown option as a request header:
uri.open("User-Agent" => "MyApp/1.0")
Result
The result of this investigation is the Down gem, which incorporates all of these improvements, and more. You can use it like this:
require "down"
result = Down.download("http://example.com/image.jpg")
result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz.jpg>
More advanced downloading could look something like this:
Down.download "http://example.com/image.jpg",
max_size: 20*1024*1024, # 20 MB
max_redirects: 5, # default is 2
proxy: "http://proxy.com" # delegates to open-uri
Conclusion
I like that I was able to make a lightweight wrapper around open-uri, which already had most of the features that I wanted, but allowed me to complete the ones that I was missing. If you want to use open-uri, but without any of the mentioned quirks, consider using Down.