Decoding Raw URL Path

This recipe demonstrates how to access the “raw” request path using non-standard (WSGI) or optional (ASGI) application server extensions. This is useful when, for instance, a URI field has been percent-encoded in order to distinguish between forward slashes inside the field’s value, and slashes used to separate fields. See also: Why is my URL with percent-encoded forward slashes (%2F) routed incorrectly?

WSGI

In the WSGI flavor of the framework, req.path is based on the PATH_INFO CGI variable, which is already presented percent-decoded. Some application servers expose the raw URL under another, non-standard, CGI variable name. Let us implement a middleware component that understands two such extensions, RAW_URI (Gunicorn, Werkzeug’s dev server) and REQUEST_URI (uWSGI, Waitress, Werkzeug’s dev server), and replaces req.path with a value extracted from the raw URL:

import falcon
import falcon.uri


class RawPathComponent:
    def process_request(self, req, resp):
        raw_uri = req.env.get('RAW_URI') or req.env.get('REQUEST_URI')

        # NOTE: Reconstruct the percent-encoded path from the raw URI.
        if raw_uri:
            req.path, _, _ = raw_uri.partition('?')


class URLResource:
    def on_get(self, req, resp, url):
        # NOTE: url here is potentially percent-encoded.
        url = falcon.uri.decode(url)

        resp.media = {'url': url}

    def on_get_status(self, req, resp, url):
        # NOTE: url here is potentially percent-encoded.
        url = falcon.uri.decode(url)

        resp.media = {'cached': True}


app = falcon.App(middleware=[RawPathComponent()])
app.add_route('/cache/{url}', URLResource())
app.add_route('/cache/{url}/status', URLResource(), suffix='status')

Running the above app with a supported server such as Gunicorn or uWSGI, the following response is rendered to a GET /cache/http%3A%2F%2Ffalconframework.org request:

{
    "url": "http://falconframework.org"
}

We can also check the status of this URI in our imaginary web caching system by accessing /cache/http%3A%2F%2Ffalconframework.org/status:

{
    "cached": true
}

If we removed RawPathComponent() from the app’s middleware list, the request would be routed as /cache/http://falconframework.org, and no matching resource would be found:

{
    "title": "404 Not Found"
}

What is more, even if we could implement a flexible router that was capable of matching these complex URI patterns, the app would still not be able to distinguish between /cache/http%3A%2F%2Ffalconframework.org%2Fstatus and /cache/http%3A%2F%2Ffalconframework.org/status if both were presented only in the percent-decoded form.

ASGI

The ASGI version of req.path uses the path key from the ASGI scope, where percent-encoded sequences are already decoded into characters just like in WSGI’s PATH_INFO. Similar to the WSGI snippet from the previous chapter, let us create a middleware component that replaces req.path with the value of raw_path (provided the latter is present in the ASGI HTTP scope):

import falcon.asgi
import falcon.uri


class RawPathComponent:
    async def process_request(self, req, resp):
        raw_path = req.scope.get('raw_path')

        # NOTE: Decode the raw path from the raw_path bytestring, disallowing
        #   non-ASCII characters, assuming they are correctly percent-coded.
        if raw_path:
            req.path = raw_path.decode('ascii')


class URLResource:
    async def on_get(self, req, resp, url):
        # NOTE: url here is potentially percent-encoded.
        url = falcon.uri.decode(url)

        resp.media = {'url': url}

    async def on_get_status(self, req, resp, url):
        # NOTE: url here is potentially percent-encoded.
        url = falcon.uri.decode(url)

        resp.media = {'cached': True}


app = falcon.asgi.App(middleware=[RawPathComponent()])
app.add_route('/cache/{url}', URLResource())
app.add_route('/cache/{url}/status', URLResource(), suffix='status')

Running the above snippet with uvicorn (that supports raw_path), the percent-encoded url field is now correctly handled for a GET /cache/http%3A%2F%2Ffalconframework.org%2Fstatus request:

{
    "url": "http://falconframework.org/status"
}

Again, as in the WSGI version, removing RawPathComponent() no longer lets the app route the above request as intended:

{
    "title": "404 Not Found"
}