One Cloud Please

Poor mans MFA for AWS Client VPN

2024-07-13T00:00:00+00:00

The AWS Client VPN service is a common way to seamlessly connect users into internal networks, however administrators often need ways to ensure a heightened level of security considering the attack surface. In this post, I describe a low-tech, low-cost solution to better authenticate users using a second factor.

Client VPN authentication methods

AWS Client VPN supports connection to federated providers, either via a dedicated Active Directory integration (via AWS Directory Service) or via a SAML provider. These options are good, however often this solution is required either in an environment without federation already established or where the VPN is required on mobile devices, which doesn’t have a supported way to perform the browser-based flow. Because of this, the mutual authentication option is an easy and convenient way to get going quickly and at a low cost.

The Active Directory integration does have the ability to integrate MFA natively, using a RADIUS server, however this typically is a complex setup.

As easy as a thumbs up

The AWS Client VPN service does have the option to provide a client connect handler for the VPN endpoint. This handler is a custom Lambda function you can write to authorize or reject each new connection attempt. Typically, the intent would be to use device posture checks or username lookups from a datastore to evaluate the outcome of the attempt, however we do have a somewhat generous 30 second limit to work with. Notably, this check is in addition to the already established mutual certificate presentation, which takes place before this check is attempted.

A creative alternative solution is to make use of the Slack Bot API to prompt the user to confirm new connections. As users initiate a connection, the Lambda function is invoked and takes the Slack user identifier embedded in the common name of the issued mutual certificate, and uses the Slack Bot API to send a direct message in Slack to the user. The user doesn’t directly respond to the message however, and is instead prompted to give it a thumbs up 👍 reaction. Once the Lambda function sends the initial message, it then short polls the Slack endpoint to retrieve the reactions on its sent message. If it detects the correct reaction before the attempt times out, it responds with a successful authentication attempt.

Here’s what that looks like in practice:

Setting it up

The following assumes you have already set up a Client VPN endpoint using mutual authentication. The AWS docs do a pretty good job at walking you through this. You’ll also need appropriate permissions to install a new bot to your Slack workspace (this is typically allowed for non-administrators).

One modification to the process is to ensure you include the Slack ID of the user in the common name of the issued certificate to clients, like the following:

./easyrsa build-client-full -.mydomain.com nopass

The Slack ID for a user can be found by clicking on the users Slack profile and selecting the “Copy member ID” option in the expand menu.

Next, we’ll set up the Slack Bot itself. To do this, visit https://api.slack.com/apps and click on the “Create New App” button. Use the “From Scratch” option, give your bot a new friendly name, and select the workspace to authorize your bot into.

I highly recommend scrolling down on the initial page and adding an App Icon for your bot to help distinguish it more.

Navigate to the “OAuth & Permissions” page for the bot and scroll to the “Scopes” section. Add the scopes chat:write and reactions:read.

Once done, scroll up and click the “Install to Workspace” button. Authorize the request, navigate back to the “OAuth & Permissions” page and you should have a “Bot User OAuth Token” generated for you, starting with xoxb-.

Take the “Bot User OAuth Token” and save it to the “token” field of a new Secrets Manager secret within your AWS account. I’ve called my secret “myslackbot” here but you can use anything you wish and modify the upcoming script as needed.

The final change is to create the authorization Lambda for the client connection handler. One particularly confusing limitation is that the name of the Lambda function must be prefixed with AWSClientVPN-. Below is the full Python source code for that - no external libraries needed!

import boto3
import os
import json
import pprint
import time
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError

def handler(event, context):
    client = boto3.client('secretsmanager')
    secret = json.loads(client.get_secret_value(SecretId='myslackbot')['SecretString'])

    channel = event['common-name'].split("-").pop().split(".")[0]
    if len(channel) < 2 or len(channel) > 12:
        return

    body = {
        'channel': channel,
        'text': 'React with a :thumbsup: to this message to approve the current login attempt from ' + event['public-ip'] + ' (' + event['platform'] + ').\n\nYou must complete this action within 30 seconds.'
    }
    req = Request(
        'https://slack.com/api/chat.postMessage',
        json.dumps(body).encode('utf-8'),
        headers={
            'Content-Type': 'application/json; charset=utf-8',
            'Authorization': 'Bearer ' + secret['token']
        }
    )
    msg = json.loads(urlopen(req).read())

    while True:
        time.sleep(2)
        req = Request(
            'https://slack.com/api/reactions.get?channel=' + msg['channel'] + "×tamp=" + msg['ts'],
            headers={
                'Content-Type': 'application/json; charset=utf-8',
                'Authorization': 'Bearer ' + secret['token']
            }
        )
        reactions = json.loads(urlopen(req).read())

        if 'reactions' in reactions['message']:
            for reaction in reactions['message']['reactions']:
                if '+1' in reaction['name']:
                    return {
                        'allow': True,
                        'error-msg-on-denied-connection': '',
                        'posture-compliance-statuses': [],
                        'schema-version': 'v2'
                    }

Once you’ve configured your client connection handler in the VPN endpoint, you have completed your setup and can test your new MFA solution for yourself.

Finishing up

The above solution was the result of running into a bunch of limitations, but then looking around and considering alternatives that may seem unusual at first however turn out to be quite effective. I’m reminded that this is a good skill to have and can lead to some new experiences that might benefit you in future circumstances.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on 𝕏 at @iann0036.

HTTPS Endpoints and more tricks with AWS Step Functions

2024-01-13T00:00:00+00:00

AWS re:Invent 2023 is now behind us and one of my favourite announcements was the introduction of HTTPS Endpoints to AWS Step Functions. In this post, I explain the feature, test its limits and also show off some other tricks for data manipulation within your state machines.

For the impatient, here is the final result.

HTTPS Endpoints feature

HTTPS endpoints use Amazon EventBridge API destination connections to determine the authentication mechanism used. This service subsequently uses Secrets Manager to store the credentials that will be included to authenticate requests.

Then within the state machine, you reference this connection and specify your own URL and HTTP method. You can also optionally include your own query parameters, headers and/or request body.

There are some limitations though. Firstly, there is a 60 second timeout (hard limit) for the totality of the request. There are additional mandatory headers which Step Functions sets and you cannot override. These are:

Host (value: hostname of the URL)
User-Agent (value: Amazon|StepFunctions|HttpInvoke|us-east-1, where us-east-1 is replaced by your region)
Range (value: bytes=0-262144)

Note that the request will still fail if the response exceeds 256kb even though the Range header is set. The presence of the header can also cause confusion as some servers will respond with a 206 Partial Content status code even if all data is returned, so be aware of that.

The client IP address for the requests are different for each request and appear to lie within the standard EC2 public IP range published by AWS. There is no capability to use Elastic IPs or other networking constructs within your account.

Your state machine IAM role will need to include actions that allow access to the connection and its associated secret, as well as the states:InvokeHTTPEndpoint action which has the optional conditionals of states:HTTPEndpoint and states:HTTPMethod to help scope down what endpoints and HTTP methods the state machine can call. I have included an example of a granular policy in the CloudFormation template at the end of this post.

Gathering the data

In order to demonstrate the capabilities of the new feature, I’ve chosen to consume the Chess.com API. This is a free and anonymous API which retrieves metadata about games and players on their platform.

I will retrieve a list of all grandmasters, their country of origin, and aggregate these details by country.

Because this is a public endpoint, there is no need for an Authorization or similar header when accessing the endpoint, however EventBridge API destinations require the use of Basic Authorization, OAuth or API Key header. One creative way of avoiding sending an unnecessary header is to create your connection using the API Key type but set the header to one of the immutable headers, such as User-Agent.

I created the step to gather the list of grandmasters by hitting the URL https://api.chess.com/pub/titled/GM. Because I am only interested in the content of the response body, I apply an OutputPath filter of $.ResponseBody. This provides me with the list of grandmaster usernames, but not their origin country or actual name. For that, we need to retrieve their details using additional individual HTTPS calls.

To do this efficiently, we use the Distributed Map type within Step Functions. To ensure we do not overload the Chess.com API, we limit the concurrency to 40. We also use a standard exponential backoff for the inner HTTPS call to allow for retries in the event of an occasional error.

This brings us to a state where we have an array of the individual grandmaster details.

Aggregating the data

Aggregating data (using map-reduce style methods) within a state machine is not a native function, however with some clever usage it is possible.

To do this, we first need to ensure all fields are present in the individual grandmaster details. Unfortunately, the name field isn’t always present on these responses so to fix that we add the following ResultSelector to the HTTPS endpoint step within the distributed map:

{
    "output.$": "States.JsonMerge(States.StringToJson('{\"name\":\"Unknown Player\"}'), $.ResponseBody, false)"
}

This takes the resulting detail from the HTTP response, and performs a JSON merge with the static object we defined with a default name. If the name is not present, this field will be used.

Next, we format the resulting name in the way we would like it, as well as extract the 2-letter country code from the URL which looks like https://api.chess.com/pub/country/US. To do this, we use a Pass state. The Parameters of the Pass state are as follows:

{
    "displayName.$": "States.Format('{} ({})', $.output.name, $.output.username)",
    "country.$": "States.ArrayGetItem(States.StringSplit($.output.country, '/'), 4)"
}

Note that the array index used is 4 and not 5. This is because empty segments (the one in between http:/ and the next /) get discarded during the States.StringSplit operation.

Using the output of the distributed map, we apply a new Pass state with the following parameters:

{
    "original.$": "$",
    "countries.$": "States.ArrayUnique($[*].country)",
    "countriesCount.$": "States.ArrayLength(States.ArrayUnique($[*].country))",
    "iterator": 0,
    "output": {}
}

The original key contains the distributed map output, the countries key uses JSONPath and States.ArrayUnique to select the unique list of countries, the countriesCount key is the length of the countries, the iterator key is initialised at 0, and the output key is initialised with an empty map.

Then we enter a loop. The loop will continue whilst the iterator is less than the length of countries. We then use a Pass state to set the country key to the country at the iterator index of the countries list. We then use one more Pass state increase the iterator with:

States.MathAdd($.iterator, 1)

We also set the output key to the following (spaced for visibility):

States.JsonMerge(
    States.StringToJson(
        States.Format(
            '\{"{}":{}\}',
            $.country,
            States.JsonToString(
                $.original[?(@.country == $.country)]['displayName']
            )
        )
    ),
    $.output
, false)

The above performs the following transformations:

Retrieve the list of all displayName strings within the original key, filtering where the country key is equal to the country within the original key entries which we previously created using JSONPath
Convert that list to a JSON string
Create a new JSON-compatible string where the key is the country and the value is the above string-encoded array of names
Convert the string to a JSON object
Merge that object with the output variable

We’re basically adding the country code as a key of the output JSON object one at a time, then increasing the iterator to reference the next country in the list.

Once it has completed the loop, we are left with our final output.

Finishing up

I have provided a CloudFormation template that contains the full state machine and associated connection here. Feel free to deploy this into your own AWS account and try it yourself.

The HTTPS Endpoints feature is a very useful addition to the Step Functions service that I believe will have huge uptake. I personally want to do more with the Step Functions service as I believe more architectures can be more than serverless, they can be “functionless” (i.e. no Lambda functions). I would however like to see more useful intrinsics become available in the service. As you can see from this post, developers are often pushing the limits of what is available. Consider this my #awswishlist item.

A big thank you to Aidan Steele for helping review this post. If you liked what I’ve written, or want to hear more on this topic, reach out to me on 𝕏 at @iann0036.

Swiping right on the AWS WAF CAPTCHA challenge

2023-07-25T00:00:00+00:00

In 2021, AWS WAF introduced a new CAPTCHA feature to help protect sites against bot traffic. The release had some mixed reviews but the idea was that it was an effective protection against programmatic solvers or “bots”.

In this post, I walk through my methodology for beating one of the CAPTCHA challenges presented programmatically. If you’d like to follow along, you can try the CAPTCHA challenges yourself here.

The AWS WAF CAPTCHA system

The CAPTCHA feature in AWS WAF is an optional action as a result of a match against customer-defined rules. It is intended to be an option to help bridge the difficult decision of a hard deny or hard allow when client heuristics may appear suspicious but not outright bot-like.

When triggered, the action prompts viewers of a website with interactive challenges designed to test that a human viewer is real and block bots seeking to crawl or disrupt human traffic. At launch, and to this day, there are two challenges available which I will call the “car maze” and “shape match” challenges.

I created a Twitter (𝕏?) thread about beating the car maze challenge when it was originally released which you can read here:

Had a bit of fun today with the WAF CAPTCHA thing. The car maze turned into a fun programming challenge! 1/ pic.twitter.com/D6Rf4SZGy4
— Ian Mckay (@iann0036) November 14, 2021

I will note that there have been some changes since writing the thread and discussing my findings with the AWS WAF service team that make the car maze challenge slightly more complex, though the same concepts still broadly apply.

Let’s go through the same process with the shape match challenge!

Shape matching

The shape match challenge features an image of 5 random 3D shapes lined up horizontally which has been split across the vertical axis and reordered. The interface gives you a slider which you can move to match usually only one shape at a time and gives you instructions as to which shape to match up and submit. The bottom section wraps as you drag the slider.

The available shapes are: ball, cone, cube, cylinder, donut, knot and pyramid.

The challenge presents both halves of the shapes as a single JPEG image, always at a 320x160 resolution. Taking a similar approach as the car maze solve, I’m using HTML canvas to inspect the image, extract pixel data and draw for my own visualization. For my first step, I sample the top-left pixel colour and eliminate these pixels from consideration. Because the challenge is a JPEG, some colour blending and artifacts are present so in most of the below steps I check for colour closeness by ensuring the RGB channels are within a small boundary (in this case, no more than 7 away). The top and bottom 80 pixels of the Y-axis represent the top and bottom sections, respectively.

I now want to identify the location and width of the shapes at the midline for the top and bottom sections. The shapes in the challenge always have a clear separation between them, so in order to do this I move left-to-right at just above and below the midline (skipping the exact pixels on the midline, as JPEG artifacting can sometimes merge the pixels at y=79 and y=80). When I hit a non-background pixel, I mark the starting point of the shape. Once I hit a background pixel again, I can presume the start and stop points on the X-axis.

This gives me a set of values which intersect at the midline, however there are typically more values than the 5 shapes that are present. This is because shapes like the donut and knot intersect the midline at multiple points. To overcome this, we need to find any space in between where the shapes hit the midline where there isn’t a clear path to the relative extremes of the axis (i.e. where it is presumed to be in the center of the donut / knot). We take the middle of each of the clear spaces and start drawing a line towards the extreme of the axis, allowing a deviation to the left or right if clear space is present. Any line that does not reach the axis extreme is considered to be within the shapes, so these points are aggregated with regard to the shape boundary at the midline. This finally provides us with 5 positions and widths for both the top and bottom sections.

Because the donut always has two midline points which are of roughly equal width, we can mark this as a high probability match straight away. Additionally, if we see a single shape with more than 2 midline point intersections we can safely assume it is of the knot as this is the only shape that does this. At this point, I can start drawing the resulting shapes on individual canvases and mark those which are assumed during development.

We can then use the widths of the top and bottom shape midline intersections and find roughly matching widths. This gives us strong candidates for matching top and bottom section shapes, allowing us to calculate the relative X-axis offset needed to create the shapes. Under good circumstances, we now have 5 completed shapes but no way of identifying at least 3 of them.

In order to discover more information about the potential shapes, we calculate more landmark points to gain additional heuristics on the shape type. These points are calculated by the following:

Point 1: From the extreme left side at the midline, move towards the Y-axis extreme
Point 2: From the extreme right side at the midline, move towards the Y-axis extreme
Point 3: From the X-axis center at the midline, move towards the Y-axis extreme - if blocked, deviate left if able
Point 4: From the X-axis center at the midline, move towards the Y-axis extreme - if blocked, deviate right if able

Here are the paths that discovery takes to find the landmark points:

A ball shape always has a short Y-axis travel for points 1 and 2 for both sections, as well as a short X-axis travel from the center of the midline for points 3 and 4. The Y-axis travel for points 3 and 4 are generally identical and have roughly the same value as the X-axis travel for points 1 and 2.

A cone or pyramid shape typically also has a short Y-axis travel for points 1 and 2 in the top section, but a large Y-axis travel for all points in the bottom section.

A cube or cylinder generally has a roughly matching X-axis and Y-axis for the diametrically opposing points (point 1 in the top and point 2 in the bottom, and vice-versa).

Although it is challenging to decide between a cone/pyramid and cube/cylinder due to their shape similarities, there is one more trick we can use. Taking a path across the X-axis just below the midline, track the colours during movement. If the colour always gradually changes slightly, we can assume there is a gradient and the shape is a cone or cylinder. If there is exactly one or two colours, these represent the visible faces of a pyramid or cube.

We’ve now successfully identified each shape and their offsets.

Solving the challenge

The challenge generally accepts an offset value as its answer and so without any UI interference we could simply respond with a network request programmatically. However, I wanted to see the actual solution occur so I looked into actually performing the sliding action.

I had never programmatically moved a slider before and it turns out it is actually a rare automation to achieve, but it is possible. I came across this StackOverflow answer which showed I can create custom mousedown, mousemove and mouseup Mouse Events which worked in order to drag the slider. Notably, there was some math required to slide to the correct position, as the image width was 320 pixels, the slider would drag a maximum of 274 pixels, and the challenge solution endpoint accepted an answer between 0 and 255.

Occasionally, identification would fail due to an edge case or similar, however this simply meant that a new challenge would load and the automation could try again immediately. There seems to be no lockout or escalation of difficulty.

The road not travelled

There were a few approaches I could have taken during the development of this solution, however I took what I thought was the simplest and easiest to understand solution. I did look into using the JavaScript version of OpenCV, which I could pretty easily use to find the contours of the shapes and I could have used this to assist with some edge case resolution.

Additionally, the audio-based accessibility CAPTCHA alternative still remains for those in the speech recognition space looking for a fun challenge.

Final thoughts

The AWS WAF CAPTCHA remains an effective deterrent for all but the most determined of bot authors. I don’t envy the position the AWS WAF service team members are in. They are charged with creating a novel, interactive CAPTCHA challenge that has little cognitive load for users but remains challenging enough that it isn’t easily toppled by bots. I believe that if there were a constantly evolving rotation of new WAF challenge types we would have an effective protection purely based on the bot authors ability to adapt. Sadly this hasn’t yet happened. Features like Bot Control seem to be a far more effective way of dealing with bot traffic without generally affecting users, so I’d recommend that instead.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter (or whatever it’s called now) at @iann0036.

Cedar: Avoiding the cracks

2023-07-06T00:00:00+00:00

With the open-source release of the Cedar engine and the general availability release of Amazon Verified Permissions, more and more engineers are considering integrating Cedar into their own systems for authorization, but what do policy authors need to consider to avoid unexpected outcomes?

In this post, I’ll walk through my experiences in where policy authoring can go wrong and the steps you can take to overcome these issues. This post will walk through some advanced evaluation scenarios, so if you’re new to the Cedar language I highly recommend you first read my introductory post on the topic, Cedar: A new policy language.

Non-unique entity identifiers

Though I mentioned it in my previous post, it’s important to always use unique identifiers for entities to ensure they do not get re-used in the future. The reason this may be a problem is that a reliance may start to occur on the entity, the entity goes away at some point in time, then a new entity of the same name comes into existence at a later point.

For example, consider the following statement:

permit(
    principal == User::"John",
    action,
    resource == Account::"Corporate"
);

If the user named John leaves the company, and then another John joins the company and happens to take the same entity identifier, it’s possible for the new John to inherit some privileges he should not be entitled to. The Cedarland blog has some more detail on the reasoning behind this.

Solutions

Always use unique identifiers, such as the identifiers your IdP provider uses, to uniquely identify principals. Additionally, use resource identifiers which are also unique for the context provided. Comments and annotations can help you keep track of identifiers where necessary.

permit(
    principal == User::"9a6afab1-5a37-4c90-aa40-24277b93ca28", // John Smith
    action,
    resource == Account::"710f18bc-b8ab-4313-b362-8e6264cfcf91" // Corporate Account
);

Invalid statements

Invalid statements not being evaluated is in my opinion one of the easiest ways to get an unexpected result from your policy evaluations. Consider the following policy:

permit(
    principal,
    action == Action::"Connect",
    resource
);

forbid(
    principal,
    action == Action::"Connect",
    resource == Endpoint::"AdminEndpoint"
) unless {
    context.viaAdminNetwork == true
};

The intention behind the policy is to allow connections to all endpoints except the admin endpoint unless the context object has the viaAdminNetwork key set to true. Unfortunately, the implementation of the context object in this example is that the viaAdminNetwork key is omitted, not false, if the call does not come from the admin network.

The result of this is that the forbid statement is not processed as there is an evaluation error due to the missing key. However, as the permit statement has been evaluated, and there are no other valid forbid statements, the result is an allow of the call. Even though the evaluated result is allow, there will be errors in the diagnostic return, as you can see from this Cedar playground screenshot:

There is more discussion on the reasoning for this behaviour over at the Cedlarland blog.

Solutions

Cedar has a validation engine that uses a schema to define the properties of entities within your system. This allows Cedar to warn you during the authoring phase when policies may not be valid. It is a best practice that you always construct a schema for your system.

The following schema would allow a developer to catch the unsafe usage of the attribute:

{
    "": {
        "entityTypes": {
            "Endpoint": {
                "shape": {
                    "type": "Record",
                    "attributes": {}
                }
            }
        },
        "actions": {
            "Connect": {
                "appliesTo": {
                    "resourceTypes": ["Endpoint"],
                    "context": {
                        "type": "Record",
                        "attributes": {
                            "viaAdminNetwork": { "type": "Boolean", "required": false }
                        }
                    }
                }
            }
        }
    }
}

Where possible, the inputs provided by the context object should be predictable. The developer may consider always setting the viaAdminNetwork key to simplify.

Alternatively, we can also modify the policy to test for the presence of the key itself, as shown:

permit(
    principal,
    action,
    resource
);

forbid(
    principal,
    action,
    resource
) unless {
    context has "viaAdminNetwork" && context.viaAdminNetwork == true
};

Developers might also consider overriding an allow result if any evaluation errors are present in the evaluation response, if that outcome is more desirable.

Dangers of short-circuiting

Short-circuiting is a performance feature of the Cedar language which allows it to skip evaluation of specific expressions that should not affect the result of the policy evaluation. It is present under the following conditions:

expression1 && expression2: expression2 is not evaluated when expression1 is false
expression1 || expression2: expression2 is not evaluated when expression1 is true
if expression1 then expression2 else expression3: expression2 is not evaluated when expression1 is false and expression3 is not evaluated when expression1 is true

This is typically a good thing, however it will not produce an error due to an invalid expression unless it actually evaluates that expression. For example, consider the below policy:

permit (
    principal,
    action == Action::"login",
    resource
)
when { context.isPrimarySite == true || principal.isBreakGlasEntity == true };

Note that this policy has the typo isBreakGlasEntity, which is missing an ‘s’. The intention behind the policy is that the login action is permitted only when accessing from the primary site under normal conditions, or if the principal is a special “break glass” entity under any conditions. This policy works under normal conditions, but due to the typo will error and not permit the break glass entity when they are most needed.

Solutions

A Cedar schema should again be used to determine the valid entity attributes during the entity modelling process and warn of inconsistencies during the policy authoring phase.

The following Cedar schema should be used to help find the typo during the authoring time of the policy:

{
    "": {
        "entityTypes": {
            "User": {
                "shape": {
                    "type": "Record",
                    "attributes": {
                        "isBreakGlassEntity": { "type": "Boolean", "required": true }
                    }
                }
            }
        },
        "actions": {
            "login": {
                "appliesTo": {
                    "principalTypes": [ "User" ],
                    "context": {
                        "type": "Record",
                        "attributes": {
                            "isPrimarySite": { "type": "Boolean", "required": true }
                        }
                    }
                }
            }
        }
    }
}

In addition to schema validation, it is also important to perform positive and negative testing against your policies (in a local or non-production environment) to ensure the policies will act in the way you expect for critical paths.

Ambiguous entity type

When writing condition statements which interact with an entity store, entities don’t have an inherit type associated with them. Consider the following entity store:

[
  {
    "uid": "User::\"alice\"",
    "attrs": {
      "active": true
    }
  },
  {
    "uid": "Action::\"redeemValidTicket\""
  },
  {
    "uid": "Ticket::\"someTicketID\"",
    "attrs": {
      "active": false
    }
  }
]

and the policy:

permit (
    principal,
    action == Action::"redeemValidTicket",
    resource
)
when { resource.active == true };

The intention behind this is to allow ticketholders redeem active tickets. The implementing developer allowed the full resource entity ID ("Ticket::\"someTicketID\"") be passed in as the resource input. Alice can’t redeem the "Ticket::\"someTicketID\"" resource as it is marked as not active, however Alice can perform a successful redemption with the resource entity ID "User::\"alice\"". Even though her user active attribute was never intended for that purpose, it nonetheless can lead to an unexpected allow.

Solutions

The developer could enforce that the “Ticket::” prefix is used (or perform the concatenation themselves).

The entity store could be modified to provide a unique attribute that the policy could match on using the has operator (resource has "ticketIssueDate").

The entity store could also be modified to place tickets in a new entity type “TicketGroup” using the parents construct and enforce via policy that the resource is within this group (resource in TicketGroup::"IssuedTickets").

Additionally, there is also a pending RFC that is discussing introducing an is operator to perform entity matching.

Unexpected order of operations

Like other languages, Cedar has a de-facto order of operations due to the way the grammar is constructed. This means that operations such as math works as you would expect:

permit (
    principal,
    action,
    resource
)
when { 1 + 2 * 3 + 4 * 5 == 27 }; // always true

It’s important to read and understand the grammar before constructing complex and ambiguous policies to avoid unintended effects. Consider the below policy:

permit (
    principal,
    action,
    resource
)
when {
    if resource.owner == principal then true else false &&
    resource.isRestricted == false
};

The intention behind the policy is to allow access when the principal is the resource owner and the resource is not restricted, however the effect of the policy is that a principal who is the resource owner is permitted access even when the resource is marked as restricted.

This is because the order of operations for an if-then-else operation is higher than that of the && operation and so the evaluation of the above condition is intrinsically like so:

if (resource.owner == principal) then (true) else (false && resource.isRestricted)

Solutions

Read the grammar when in doubt of the order of operations.

If you are ever in doubt, or simply want to be more explicit, use parentheses to explicitly show the intended grouping of operations:

permit (
    principal,
    action,
    resource
)
when {
    (if resource.owner == principal then true else false) &&
    resource.isRestricted == false
};

Side channels

Issues can often arise from the specific implementation that surrounds the use of Cedar, whether via Amazon Verified Permissions or a direct engine implementation. The engine can only evaluate against the inputs you have provided and if those inputs are not sanitized or invalid, it can lead to a compromise.

Late last year, the popular json5 library released a security advisory regarding the potential for prototype pollution. If you were to allow a user to specify their own context object, but override certain keys which were used in sensitive operations, an attacker could use this vulnerability to manipulate the inputs the Cedar engine receives.

// userInput = '{"foo": "bar", "__proto__": {"isAdmin": true}}'

const ctx = JSON5.parse(userInput);
if (secCheckKeysSet(ctx, ['isAdmin', 'isMod'])) {
  throw new Error('Forbidden...');
}

return avpclient.isAuthorized({
  'context': ctx,
  ...
});

Solutions

As always, a healthy supply-chain security program is recommended for organizations who make heavy use of external libraries. Input sanitization is also an important step to ensure that the engine can make appropriate authorization decisions.

As more and more built-in integrations become available, take advantage of these to shift more of the burden outside of your responsibility and avoid side-channel issues.

Wrapping up

As new language bindings, AWS integrations, external integrations, and even changes to the Cedar language itself continue to be produced, the overall community and ecosystem is growing. The scenarios above highlight the importance of a solid understanding of the language, but also solutions to help you overcome these hurdles and scale your authorization logic faster than would otherwise be possible.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter at @iann0036. You can also join the discussion over at the official Cedar Slack workspace.

Exploring Amazon VPC Lattice

2023-04-01T00:00:00+00:00

(yes, that is a picture of my breakfast)

Today, AWS has released Amazon VPC Lattice to General Availability. This post walks through creating a simple VPC Lattice service using CloudFormation, and takes a look at the service overall.

VPC Lattice was my #1 favourite announcement of AWS re:Invent 2022, so I’m excited to see it released today. As of the time of writing, it’s available in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Europe (Ireland).

How it works

VPC Lattice is a service that enables you to connect clients to services within a VPC. It is very similar to AWS PrivateLink (also known as private VPC Endpoints), but with a key difference.

Whilst PrivateLink works by placing Elastic Network Interfaces within your subnet, which your clients can hit to tunnel network traffic through to the destination service, VPC Lattice works by exposing endpoints as link-local addresses. Link-local addresses are (generally) only accessible by software that runs on the client instance itself.

AWS has carved out the range 169.254.171.0/24 for VPC Lattice’s use, typically routing directly to 169.254.171.0 (there’s also an IPv6 equivalent). This is not the first network that AWS exposes via link-local addresses. You may know of:

EC2’s Instance Metadata Service, which is located at 169.254.169.254
Route 53’s DNS Resolver, which is located at 169.254.169.253
ECS’s Task Metadata Endpoint, which is located at 169.254.170.2
Amazon Time Sync Service (NTP), which is located at 169.254.169.123

Generally, these endpoints are automatically available to clients within the VPC network without any special routing or security rules. VPC Lattice differs from this slightly, as it requires Security Groups and NACLs to allow traffic to and from the VPC Lattice data plane at 169.254.171.0/24 on whichever port the destination service exposes. I was pretty surprised by this requirement when I saw it as it’s the first link-local address to need this, but it does give network administrators some basic control. Generally, it’s advised to use a managed prefix list instead of the exact range above, as it’s subject to change.

Targets which VPC Lattice connects to closely match that of load balancing target groups, including EC2 instances, VPC IP addresses (both IPv4 and IPv6), Lambda functions, and ALBs. An EKS-specific target type is in private beta as of the time of writing.

A walkthrough

For this walkthrough, we’ll discuss the various components needed for a VPC Lattice setup. For simplicity, we’ll be creating a Lambda function as a client (initiates a HTTPS request), and another Lambda function as a server (responds to the HTTPS request). If you want to skip ahead, here’s the completed template.

Let’s begin by creating a basic VPC. The VPC will have two private subnets, but we won’t add any direct routing between them. For simplicity, we’ll also skip adding Network ACLs.

Resources:

  # Basic VPC

  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true

  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      CidrBlock: 10.0.0.0/24
      MapPublicIpOnLaunch: false
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Subnet (Source Subnet)
      AvailabilityZone: !Select
        - 0
        - Fn::GetAZs: !Ref AWS::Region

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      CidrBlock: 10.0.1.0/24
      MapPublicIpOnLaunch: false
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Subnet (Destination Subnet)
      AvailabilityZone: !Select
        - 1
        - Fn::GetAZs: !Ref AWS::Region

  RouteTablePrivate1:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Route Table (Source Subnet)

  RouteTablePrivate1Association1:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref RouteTablePrivate1
      SubnetId: !Ref PrivateSubnet1

  RouteTablePrivate2:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Route Table (Destination Subnet)

  RouteTablePrivate2Association1:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref RouteTablePrivate2
      SubnetId: !Ref PrivateSubnet2

Next, we’ll create the service itself. The service will be a Lambda function which performs a basic successful response to any requests, whilst including it’s own event payload in its response body. The function will be within the second private subnet within the VPC, and its security group will only have a single inbound rule from the VPC Lattice service on the port in which it serves.

  # Inbound Lambda (Service)

  InboundLambdaFunctionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
      Policies:
        - PolicyName: root
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - xray:PutTraceSegments
                  - xray:PutTelemetryRecords
                  - ec2:CreateNetworkInterface
                  - ec2:DescribeNetworkInterfaces
                  - ec2:DeleteNetworkInterface
                Resource: '*'

  InboundLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt InboundLambdaFunctionRole.Arn
      TracingConfig:
        Mode: Active
      Runtime: python3.9
      Timeout: 10
      Code:
        ZipFile: |
          import os
          import json
          import http.client

          def handler(event, context):
            print(event)
            return {
              "statusCode": 200,
              "body": json.dumps({
                "success": "true",
                "capturedEvent": event
              }),
              "headers": {
                "Content-Type": "application/json"
              }
            }
      VpcConfig:
        SecurityGroupIds:
          - !Ref InboundLambdaFunctionSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnet2

  InboundLambdaFunctionSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for InboundLambdaFunction
      VpcId: !Ref VPC
      SecurityGroupEgress: []
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 169.254.171.0/24 # should be the prefix list instead, this'll work though
      GroupName: demo-inboundsg

Next up, we’ll create the components of the VPC Lattice service itself. This includes:

The service network
A security group which controls which clients may access the service network
The service we are creating
A listener for the service (HTTPS on port 443)
A target group for the listener to point to, with an initial target of the previously created Lambda function

To keep things simple, we’re not adding an auth policy for the service network or the service itself.

  # VPC Lattice

  VPCLatticeServiceNetwork:
    Type: AWS::VpcLattice::ServiceNetwork
    Properties:
      Name: demo-servicenetwork
      AuthType: NONE

  VPCLatticeServiceNetworkSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for service network access
      VpcId: !Ref VPC
      SecurityGroupEgress: []
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: !GetAtt VPC.CidrBlock
      GroupName: demo-servicenetworksg

  VPCLatticeServiceNetworkVPCAssociation:
    Type: AWS::VpcLattice::ServiceNetworkVpcAssociation
    Properties:
      SecurityGroupIds:
        - !Ref VPCLatticeServiceNetworkSecurityGroup
      ServiceNetworkIdentifier: !Ref VPCLatticeServiceNetwork
      VpcIdentifier: !Ref VPC

  VPCLatticeService:
    Type: AWS::VpcLattice::Service
    Properties:
      Name: demo-service
      AuthType: NONE

  VPCLatticeServiceNetworkServiceAssociation:
    Type: AWS::VpcLattice::ServiceNetworkServiceAssociation
    Properties:
      ServiceNetworkIdentifier: !Ref VPCLatticeServiceNetwork
      ServiceIdentifier: !Ref VPCLatticeService

  VPCLatticeListener:
    Type: AWS::VpcLattice::Listener
    Properties:
      Name: demo-listener
      Port: 443
      Protocol: HTTPS
      ServiceIdentifier: !Ref VPCLatticeService
      DefaultAction:
        Forward:
          TargetGroups:
            - TargetGroupIdentifier: !Ref VPCLatticeTargetGroup
              Weight: 100

  VPCLatticeTargetGroup:
    Type: AWS::VpcLattice::TargetGroup
    Properties:
      Name: demo-targetgroup
      Type: LAMBDA
      Targets:
        - Id: !GetAtt InboundLambdaFunction.Arn

It’s important to note that by associating the service network to the VPC, there are routes created within the VPCs route table to correctly send traffic destined towards 169.254.171.0/24 to the VPC Lattice service.

The target group also automatically adds a resource-based policy statement to the Lambda function for you (some other services require you to explicitly add an AWS::Lambda::Permission).

Finally, we’ll create the client which will send requests to the VPC Lattice service. Again, this will be driven via a basic Lambda function. Note that this time, the security group requires an outbound rule towards the VPC Lattice service.

  # Outbound Lambda (Client)

  OutboundLambdaFunctionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
      Policies:
        - PolicyName: root
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - xray:PutTraceSegments
                  - xray:PutTelemetryRecords
                  - ec2:CreateNetworkInterface
                  - ec2:DescribeNetworkInterfaces
                  - ec2:DeleteNetworkInterface
                Resource: '*'

  OutboundLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt OutboundLambdaFunctionRole.Arn
      TracingConfig:
        Mode: Active
      Runtime: python3.9
      Environment:
        Variables:
          ENDPOINT: !GetAtt VPCLatticeServiceNetworkServiceAssociation.DnsEntry.DomainName
      Timeout: 10
      Code:
        ZipFile: |
          import os
          import json
          import http.client

          def handler(event, context):
            conn = http.client.HTTPSConnection(os.environ["ENDPOINT"])

            conn.request("POST", "/", json.dumps(event), {
              "Content-Type": 'application/json'
            })
            res = conn.getresponse()
            data = res.read()

            print(data.decode("utf-8"))
      VpcConfig:
        SecurityGroupIds:
          - !Ref OutboundLambdaFunctionSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnet1

  OutboundLambdaFunctionSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for OutboundLambdaFunction
      VpcId: !Ref VPC
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 169.254.171.0/24 # should be the prefix list instead, this'll work though
      SecurityGroupIngress: []
      GroupName: demo-outboundsg

Now that our template is done, we can deploy it via CloudFormation. If you got stuck anywhere, try the pre-made version here.

Once deployed, navigate to the Lambda console and find the function named something similar to “OutboundLambdaFunction”. Create a test event using any JSON object and invoke it. You should see the results from the service come back to you by observing the logs.

A note on pricing

It’s worth noting that the pricing model for VPC Lattice is different to that of PrivateLink and will probably end up costing you more overall. For N. Virginia, a PrivateLink service costs $0.01/hour per availability zone, plus $0.01/GB with volume discounts. For the same region, a VPC Lattice service costs $0.025/hour regardless of AZs, plus $0.025/GB with no volume discounts, plus $0.10 per million requests (with the first 300k requests per hour free).

Wrapping up

I’m interested to see how architectures will evolve with this new technology. Whilst PrivateLink remains more affordable and already widespread, I can see architects reaching for this new technology to improve their security posture and reduce the load on networking engineers.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter at @iann0036.

Cedar: A new policy language

2023-01-11T00:00:00+00:00

Cedar is a new language created by AWS to define access permissions using policies, similar to the way IAM policies work today. In this post, we’ll look at why this language was created, how to author the policies, and some additional features of the language. The language was designed by the Amazon automated reasoning team for use in new services such as Amazon Verified Permissions, AWS Verified Access and likely other future services and integrations.

Why write a new language?

IAM policies, introduced over 11 years ago, have been integrated into the AWS ecosystem as the fundamental way to control both human and system access to AWS resources. IAM policies are highly optimized for AWS and have constructs (like ARNs) which make it not suitable for usage on principals and resources outside of AWS.

Cedar is a generalist language which has no implicit AWS constructs within it, and this allows it to be used as an authorization engine for non-AWS applications. This is why it’s used at the core of the Amazon Verified Permissions service, where AWS manages the policy dataset and allows systems to directly make authorization calls against the evaluation engine. Incidentally, the name “Cedar” was coined as a follow on from the internal policy language of IAM, “Balsa”.

Cedar is written in Rust, which makes it run in milliseconds, and was designed to be simple to reason about the effect of policies. For example, it allows for the creation of tooling which takes two policies and determines whether they are exactly equivalent, or whether there are authorization requests that would differ in the result when evaluated against each policy.

How it works

The policy evaluation engine for the Cedar language takes one or more policies, and evaluates whether a requested action is permitted or forbidden (allowed or denied). Cedar requires the principal making the request, the action being taken, the resource being accessed, and optionally additional request context at the time of the authorization call. Cedar also consumes the policies to be evaluated and may also use a list of entities (principals, actions and resources) that exist within your application, however these may be provided ahead of time or indirectly depending upon the service integration.

The request context object may be set by the requesting application or, in the case of AWS Verified Access, defined by the service.

Cedar has a playground which allows you to play with the engine itself. It is also currently integrated into the Amazon Verified Permissions and AWS Verified Access services. As of the time of writing, Cedar is not available as an open-source or otherwise downloadable library.

Syntax

A typical Cedar policy statement looks like the following:

permit(
    principal == User::"John",
    action == Action::"view",
    resource
)
when {
    resource in Folder::"John's Stuff" &&
    context.authenticated == true
};

A policy can contain a number of statements by simply appending them onto the policy document. The syntax is not whitespace dependent and may be compressed into a single line. Typically, principals and resources should use immutable identifiers and not names. The examples in this post use simple names for readability purposes only.

The policy contains the following parts:

The effect, which will always be either permit or forbid
The scope, which specifies the principals, actions, and resources to which the effect applies
Optionally, condition clauses, which may either be a when or an unless condition

Entities (principals, actions or resources) will always follow the format TypeOfEntity::"UniqueIdentifier". The type of entity may be further namespaced, for example, Company::Account::Department::Person::"John".

Entity types are ambiguous and not determined by their namespace. This means a single entity can be either a principal, action or resource, depending upon the specific context. The only exception is that actions must have their rightmost namespace use the keyword Action (i.e. Action::"MyAction", CustomNamespace::Action::"MyAction").

Evaluation logic

When evaluating a request, Cedar will consider all statements within the policy, and in the case of Amazon Verified Permissions, all policies provided in a policy store (as if it were one big policy). If any forbid statement matches the request, the request will be denied, regardless of any permit statements. If at least one permit statement matches the request (and no forbid statements match), the request will be allowed. If no statements match, the request will be implicitly denied.

If you’ve worked with AWS IAM, you’ll recognize Cedar’s policy evaluation logic is the same. This also means that ordering of statements in a policy is irrelevant and has no effect on the outcome of an authorization request.

Because forbid statements are applied universally without the ability to override, they are commonly used to craft guardrails across the entire policy store.

The scope

The scope is written in a way that almost looks like a set of arguments in a function. It always consists of the keywords principal, action and resource. Each of these keywords may optionally be followed by either an == Some::"Entity" or an in Some::"Group" to scope down the principals, actions or resources in which the statement applies to. In addition, an inline set in the form in [ Some::"Entity", SomeOther::"Entity", ... ] can be used for the action keyword only. When no keywords have this suffix, the policy applies to all requests, so long as the conditions are met.

The scope is generally used for role-based access control, where you would like to apply policies scoped to a specific defined or set of resources, actions, principals, or combination thereof.

Condition clauses

Condition clauses further limit whether a policy takes effect for the specific request. Typically policy statements will either have no condition clauses or one condition clause, however the syntax does allow for any number of condition clauses to form a statement.

Condition clauses are more flexible than the scope, featuring a basic set of operators to allow you to form a boolean result of acceptance based off of the principal, action, resource or context of the request, as well as the attributes or nested hierarchy of these entities where a list of entities has been defined. The use of logical operators such as && and || allow you to form long, complex conditions to match your specific requirements. The like operator allows you to perform string matching with the use of a * wildcard character.

Condition clauses are intended to perform attribute-based access control. Though it is possible to include scope conditions within a condition clause, exactly the way you would in the scope, it’s recommended that you retain those scope conditions in the scope for both readability and performance reasons.

Additional language features

Using the above syntax is all you need to start writing basic statements to permit or forbid access to your application, however there are some more features of the language which we’ll go through. Some of these features may not be available or useful depending upon the service in which Cedar is integrated into.

Comments

Policies may contain the // operator to add comments, which are particularly useful for indicating an abstract identifier, for example:

// the following was added by the accounts team
// it was approved by Jane Doe
permit(
    principal == User::"9a6afab1-5a37-4c90-aa40-24277b93ca28", // John Smith
    action,
    resource == Account::"710f18bc-b8ab-4313-b362-8e6264cfcf91" // MyCorp Dev Account
);

Entities

Cedar supports accepting a list of known entities (resources, actions or principals) within a system. This is helpful as you may author policies which interact with the hierarchy or attributes of the entities within condition clauses. When an authorization request is made, the principal, action and resource identifiers will correlate to the defined entity of the same identifier when present in the entity list.

The structure of the entity list differs from service to service. In the Cedar playground, the entity list looks like the following:

[
  {
    "uid": "User::\"john\"",
    "parents": [
      "UserGroup::\"Staff\""
    ],
    "attrs": {
      "department": "Hardware Engineering",
      "age": 30
    }
  },
  {
    "uid": "UserGroup::\"Staff\""
  }
]

In Amazon Verified Permissions (for an IsAuthorized call), the same entity list would look like this:

[
  {
    "EntityId": {
      "EntityType": "User",
      "EntityId": "john"
    },
    "Parents": [
      {
        "EntityType": "UserGroup",
        "EntityId": "Staff"
      }
    ],
    "Attributes": {
      "department": {
        "String": "Hardware Engineering"
      },
      "age": {
        "Long": 30
      }
    }
  },
  {
    "EntityId": {
      "EntityType": "UserGroup",
      "EntityId": "Staff"
    }
  }
]

We can use the known attributes in the entity to construct policies that permit or forbid access. For example:

permit(
    principal,
    action == Action::"Access",
    resource == Room::"Drinks Lounge"
) when {
    principal.age >= 18
};

This policy allows access only when the principal has the attribute “age”, and its value is equal to or greater than the number 18. If the age attribute wasn’t set, or the principal wasn’t defined at all in the entities list, this statement wouldn’t permit access.

The entities can also have the concept of a hierarchy, at any nesting level, to act based on this. For example:

permit(
    principal,
    action == Action::"Access",
    resource == Room::"Common Area"
) when {
    principal in UserGroup::"Staff"
};

This policy allows any entity which has a parent of the UserGroup::"Staff" entity access. Once again, if the entity isn’t defined or isn’t a child of UserGroup::"Staff", this statement wouldn’t permit access. The in operator applies to both direct children, as well as all descendants of those children. Additionally, the in operator also applies to the referenced parent, i.e. if the principal was UserGroup::"Staff" in the above example the policy would permit access.

Extensions

In addition to the base data types of strings, booleans, integers and sets/arrays, Cedar supports the additional data types of IP addresses, and decimals. These two data types can only be declared using a function call-like syntax, and can only be operated on using their in-built methods. These data types are known as extensions.

In the case of IP addresses, the syntax looks like the following:

permit(
    principal,
    action,
    resource
) when {
    ip(context.client_ip).isInRange("10.0.0.0/8")
};

The IP address type is created using the ip(...) syntax, and calls the isInRange(...) function to return a boolean. A similar effect is seen for the use of the decimal types:

forbid(
    principal,
    action,
    resource
) when {
    decimal(context.risk_score).greaterThan(decimal("7.2"))
};

Because Cedar does not allow any floating point types to be passed in, inputs must be in the form of a string (i.e. “8.24”). Decimal supports up to 4 digits after the decimal point.

Both extensions have a number of other methods available, all of which currently return a boolean result.

Policy templates

Policy templates is a Cedar feature useful for applying a common policy to a large group of principals or resources. A policy template allows you to add a variable substitution to the equality operators in the scope block for the principal and/or resource keywords. A policy template by itself is not effective, but allows policies to be created by simply providing the variable values instead of duplicating the full syntax. Policies generated from policy templates will automatically update if a policy template changes. A policy template may look like this:

permit(
    principal == ?principal,
    action == Action::"download",
    resource in ?resource
) when {
    context.mfa == true
};

The ?principal and ?resource keywords represent the variables that may be substituted. A policy created from this template would allow the principal to download all children of the resource when accessing using MFA.

Examples

The following is a set of examples to help you get started and understand the language.

Allow all

Policy:

permit(
    principal,
    action,
    resource
);

This statement permits all requests. It may be restricted by forbid statements elsewhere in the policy set.

Deny all

Policy:

forbid(
    principal,
    action,
    resource
);

This statement forbids all requests. It cannot be overridden and renders all other statements in the policy set useless.

Specific RBAC policy

Policy:

permit(
    principal == Customer::"John",
    action == Action::"checkout",
    resource == CheckoutCounter::"12"
);

This statement allows customer “John” to checkout at checkout counter 12.

When condition clause

Policy:

permit(
    principal,
    action == Action::"connectDatabase",
    resource == Database::"db1"
) when {
    context.port == 5432
};

Context:

{
    "port": 5432
}

This statement allows any principal to connect to database “db1”, so long as the “port” attribute in their request context is 5432.

Unless condition clause

Policy:

permit(
    principal,
    action in [HTTPMethod::Action::"GET", HTTPMethod::Action::"POST", HTTPMethod::Action::"DELETE"],
    resource
) unless {
    [Viewer::"anonymous", Viewer::"unknown"].contains(principal) ||
    context.waf_risk_rating >= 7
};

Context:

{
    "waf_risk_rating": 8.5
}

This statement allows any principal to perform a HTTP GET, POST or DELETE against any resource unless they are identified as an anonymous or unknown viewer or their WAF risk rating is greater than or equal to 7.

IP and decimal usage

Policy:

permit(
    principal,
    action == HTTPMethod::Action::"GET",
    resource
) when {
    (
        // local subnet or same machine
        ip(context.http_request.client_ip).isInRange(ip("10.0.0.0/8")) ||
        ip(context.http_request.client_ip).isLoopback()
    ) &&
    decimal(context.risk_score).lessThan(decimal("6.5"))
};

Context:

{
    "http_request": {
        "client_ip": "10.0.1.54"
    },
    "risk_score": "4.7"
}

This statement allows any principal to perform a HTTP GET against any resource when their IP address is within the 10.0.0.0/8 or loopback CIDR range and the value of the string-encoded risk score is less than 6.5.

Entity attributes

Policy:

permit(
    principal,
    action == SecuritySystem::Action::"swipeCardAccess",
    resource == Room::"Sydney Boardroom"
) when {
    principal.location like "Sydney*" ||
    principal.training.contains("All Access")
};

Entities:

[
    {
        "uid": "Employee::\"1453\"",
        "attrs": {
            "location": "Sydney East",
            "training": [
                "General"
            ]
        }
    },
    {
        "uid": "Employee::\"325\"",
        "attrs": {
            "location": "Los Angeles",
            "training": [
                "General",
                "All Access"
            ]
        }
    }
]

This statement allows any principal to swipe card access to the Sydney Boardroom if their location attribute starts with “Sydney” or their training attribute contains the “All Access” item. Both employees 1453 and 325 would be permitted under this statement.

Entity attributes relationship

Policy:

permit(
    principal,
    action == HTTP::Action::"GET",
    resource
) when {
    resource.owner == principal.username
};

Entities:

[
    {
        "uid": "User::\"Josh\"",
        "attrs": {
            "username": "josh1"
        }
    },
    {
        "uid": "File::\"blogpost.txt\"",
        "attrs": {
            "owner": "josh1"
        }
    }
]

This statement allows any principal to HTTP GET a file which they have ownership of. The entity User::"Josh" would be permitted to perform a HTTP::Action::"GET" on the File::"blogpost.txt" entity.

Entity inheritance

Policy:

forbid(
    principal,
    action,
    resource == Application::"oracle"
) unless {
    principal in Group::"Admins"
};

Entities:

[
    {
        "uid": "User::\"Ian\"",
        "parents": [
            "Group::\"Admins\"",
            "Group::\"Users\""
        ]
    }
]

This statement forbids any principal to perform any action against the oracle application unless they are a part of the Admins group. The entity User::"Ian" would be exempt from this forbid statement.

Policy template

Policy Template:

permit(
    principal == ?principal,
    action == Action::"Connect",
    resource == ?resource
);

Policy Variables:

principal: User::"Harry"
resource: VPN::"vpn1"

The policy created from the policy template allows the user Harry to connect to the VPN “vpn1”.

Wrapping up

The Cedar language is both excitingly new and comfortingly familiar. It opens a new world of possible use cases and, of course, a new set of challenges and considerations. I look forward to seeing how the language gets used in real world scenarios and the ways people will architect their applications around the services Cedar supports.

A big thank you to members from the identity and automated reasoning teams for helping answer some questions I had during the creation of this post. If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter at @iann0036.

Patching the AWS JavaScript SDK for Service Workers

2022-01-11T00:00:00+00:00

The AWS JavaScript SDK supports Node.js, React Native and web browsers, but what if you’re running in a service worker? In this post, I’ll explain how I modified version 2 of the AWS JavaScript SDK to run within a service worker context.

Background

For the Former2 project, I produce browser extensions for most major browsers in order to bypass the lack of CORS for the majority of AWS services. This means that I embed a copy of the AWS JavaScript SDK in order to make the calls needed via the browser extension, which has authority to ignore the lack of CORS.

The browser extensions use a “manifest”, which details the functionality of the extension and what actions are permitted. Google is sunsetting version 2 of the manifest for Google Chrome and requires all extensions to move to manifest version 3 by the end of 2022. Along with some structural differences, one of the major changes required is to move from background pages (logic that runs in the background of an extension) to service workers.

Service workers (which are a subset of JavaScript workers) have greater limitations than background pages, including the lack of access to the DOM and its features, as well as the replacement of XMLHttpRequest for fetch. Service workers will also move to an inactive state if unused in a short period of time, meaning initialized variable data isn’t persisted, though I’ve skipped talking about my specific remediations to this in this article (hint: use IndexedDB).

The Challenge

Version 3 of the AWS JavaScript SDK is written in a way that it’s supported in a service worker context, but version 2 does not due to a variety of reasons. If you’re already using version 3 of the SDK, or are starting development on a service worker from scratch using version 3, you won’t have a problem.

As the Former2 project heavily relies on the syntax of version 2 of the SDK, as well as the fact that the service calls a majority of available services in the SDK, I wanted to avoid a migration effort to version 3 of the SDK. Others with existing projects making heavy use of SDK version 2 that are seeking to move to service workers (or CloudFlare Workers) might also benefit from this.

Note that this is not an official change, and these changes could break current or future functionality in unintended ways, so I don’t recommend you use this in a production context.

Attempting to import

After performing the changes to the browser extension manifest, my first issue was that the SDK script could no longer be directly loaded into the shared DOM model.

Before:

"background":  {
  "scripts": [
    "aws-sdk-2.1046.0.js",
    "bg.js"
  ]
},

After:

"background":  {
  "service_worker": "bg.js"
},

Service workers come with a way to load scripts using the importScripts() function. So I added the following to the top of my bg.js script:

importScripts("aws-sdk-2.1046.0.js");

This addition now silently failed the AWS calls I requested the extension make, without much debugging information.

It’s at this point that I’d like to call out Saurav Kushwaha for his prior work in this area, which overrides the XHRClient class used in the AWS namespace with fetch. I did need to perform a couple of slight modifications to properly return correct error codes however.

After replacing the XHRClient class, I was happy to see that some calls were successfully returning, but for some reason there was still some failures.

XML is hard

The failures I was seeing were coming from STS and S3, and I quickly realised that these were APIs that returned XML-based responses.

One immediate problem that actually showed error logs was that window was not defined, where parts of the SDK expected it to be available.

I quickly added a one-liner to make that available during initialisation:

if(!window){var window = {}};

After that change, I was now receiving an error that it could not load the XML parser.

Digging into the SDK, the logic looked like the following:

if (window.DOMParser) {
  // use the native DOM parser library
} else if (window.ActiveXObject) {
  // use the ActiveXObject to parse, a fallback for IE8 and lower
} else {
  throw new Error("Cannot load XML parser");
}

The SDK relies on the native DOM parser to interpret XML responses from those services, so in order to alleviate this I decided to find a polyfill to replace it. I came across xmldom module on npm and found it suitable for my needs. I did need to bundle this into a browser-compatible library, so used browserify to achieve this.

After importing the new DOM parser library for use by the SDK, I re-tested the calls which produced a valid response end-to-end. All done, or so I thought.

Something strange

Though my application now seemed to be working well, producing no errors and always returning valid responses, I noticed that many of my list calls (for example, S3.ListBucket) weren’t returning the resources within my account I expected.

I suspected some issues with the XML parser and dumped both the response of the HTTP call, and the object immediately after xmldom had parsed it. Both of these correctly showed the bucket names I was expecting, yet the response produced an empty array.

This one hurt my head. After debugging for probably a few hours, I found the issue. During the process of constructing the response in a clean format, the SDK requests the properties Element.firstElementChild and Element.nextElementSibling from the parsed object, however xmldom had not yet implemented these properties and so the iterators were silently failing.

After having a look at the xmldom library to investigate whether it could be easily patched, I instead simply implemented these properties as methods directly and replaced the SDK code which accesses these properties with my implementation, as shown below:

function getFirstElementChild(xml) {
  for (var i = 0; i < xml.childNodes.length; i++) {
    if (xml.childNodes[i].hasOwnProperty('tagName')) {
      return xml.childNodes[i];
    }
  }
  return null;
}

function getNextElementSibling(xml) {
  var foundSelf = false;
  for (var i = 0; i < xml.parentNode.childNodes.length; i++) {
    if (xml.parentNode.childNodes[i] === xml) {
      foundSelf = true;
      continue;
    }
    if (foundSelf && xml.parentNode.childNodes[i].hasOwnProperty('tagName')) {
      return xml.parentNode.childNodes[i];
    }
  }
  return null;
}

Wrapping up

After all the above changes were made, I was able to produce a version of the version 2 SDK which, from all the tests I’ve made, seems to work as intended within a service worker context.

I’ve made a version of the service worker-compatible SDK available on GitHub, should you want to compile your own. Refer to the official docs for specific compilation options, as they should work the same.

I got pretty close to abandoning this experiment, but I’m glad I persisted. I learned a lot about the internals of the SDK and got a working alternative in the end. If you liked what I’ve written, or want to tell me how terrible of an idea this was, reach out to me on Twitter at @iann0036.

Migrating to OpenSearch with CloudFormation

2021-10-05T00:00:00+00:00

Last month, AWS announced that the Amazon Elasticsearch Service has become Amazon OpenSearch Service. This change has effectively stopped any further updates to the Elasticsearch product line within the service due to changes to Elastic’s licensing model, and the forked OpenSearch will become the only product line to receive updates in the future.

In this post, we’ll walk through how to migrate from an Elasticsearch domain to an OpenSearch domain if you already have your domain defined within CloudFormation. If you don’t have your domain defined in CloudFormation but would like to, we’ll also cover that.

The changes

Let’s go through the changes in the service. The service name itself has changed from Amazon Elasticsearch Service to Amazon OpenSearch Service or, annoyingly, its current full canonical name “Amazon OpenSearch Service (successor to Amazon Elasticsearch Service)”. The Kibana equivilent is also now known as OpenSearch Dashboards.

The 18 Elasticsearch versions currently supported in the service will be all the Elasticsearch versions there will ever be, with only OpenSearch versions being included in the future. This also includes the possibility for the addition of features exclusive to OpenSearch, though Elastic could easily incorporate OpenSearch features into Elasticsearch if they wanted, despite the reverse no longer being an option. OpenSearch has added 3 of these exclusive features in their first version: Transforms, Data Streams, and Notebooks.

Though AWS has provided an easy upgrade path from Elasticsearch to OpenSearch within the console, the same cannot be said about CloudFormation which has created a new resource type for OpenSearch. You must not simply update the CloudFormation type in your template, as this will lead to the deletion of your domain and all data within it.

We will be migrating from one stack from another in this case (which is helpful as many of you may have used “elasticsearch” or “es” in your stack name), although you could follow a similar approach entirely within an existing stack. Despite the CloudFormation docs indicating OpenSearch resources are not supported for this operation (as of the time this was published), almost all resources created from about the end of 2019 should have CloudFormation import support, as is the case here.

Before beginning, you should first confirm that your configuration won’t be affected by the minor breaking changes and you should take a manual snapshot for safety, as the process is irreversable. This is a dangerous process if you don’t take care so please follow all steps and precautions carefully if your data is critical.

Preparing your existing stack

To begin, we will prepare the existing stack to be deprecated. If you do not currently have your domain within a CloudFormation stack (but would like to), you can instead generate a template for your new domain using Former2.

My existing template looks like this:

Resources:
  ...

  MyElasticsearchDomain:
    Type: AWS::Elasticsearch::Domain
    Properties:
      DomainName: mydomain
      ElasticsearchClusterConfig:
        InstanceCount: 3
        InstanceType: t3.medium.elasticsearch
        DedicatedMasterEnabled: true
        DedicatedMasterType: t3.medium.elasticsearch
        DedicatedMasterCount: 3
        ZoneAwarenessEnabled: true
        ZoneAwarenessConfig:
          AvailabilityZoneCount: 3
      EBSOptions:
        EBSEnabled: true
        VolumeSize: 20
        VolumeType: gp2
      ElasticsearchVersion: "7.10"
      ...

Your template may have a number of different properties and other resources not shown here. Take a copy of the contents of your template now, and save this for later.

Now, we’re going to add the DeletionPolicy and UpdateReplacePolicy attributes to the resource with the value Retain, like so:

Resources:
  ...

  MyElasticsearchDomain:
    DeletionPolicy: Retain
    UpdateReplacePolicy: Retain
    Type: AWS::Elasticsearch::Domain
    Properties:
      DomainName: mydomain
      ElasticsearchClusterConfig:
        InstanceCount: 3
        InstanceType: t3.medium.elasticsearch
        DedicatedMasterEnabled: true
        DedicatedMasterType: t3.medium.elasticsearch
        DedicatedMasterCount: 3
        ZoneAwarenessEnabled: true
        ZoneAwarenessConfig:
          AvailabilityZoneCount: 3
      EBSOptions:
        EBSEnabled: true
        VolumeSize: 20
        VolumeType: gp2
      ElasticsearchVersion: "7.10"
      ...

By doing this, we are telling CloudFormation to not touch the connected resource (the domain) when this resource is deleted from the stack. In my case, I also had a number of CloudWatch alarms and even some custom resources for index and search template creation defined in my stack. These aren’t important to me during this short migration exercise, so I simply deleted them outright from my template as they will be recreated based on the template copy we took previously. My template now consists of only the AWS::Elasticsearch::Domain resource, as well as any parameters and outputs that existed previously.

Update your stack now with the new content. As expected, the supporting resources such as CloudWatch Alarms will be deleted during this process (if any), however the Elasticsearch domains remain untouched.

Upgrading your domain in-place

Next up, we’ll use the defined process to upgrade the domain in-place. Note this can produce a minor downtime so plan around this. My upgrade with a small amount of documents took approximately 30 minutes.

Open the AWS Management Console, choose the domain that you want to upgrade, choose Actions, and then select Upgrade.

Choose OpenSearch 1.0 as the version to upgrade to, and I highly recommend selecting the Enable compatibility mode option to reduce the risk of any incompatibility issues. Check upgradeability and once verified, you can select the Upgrade operation.

You can prepare the next steps whilst waiting for the upgrade to complete.

Preparing your new template

We’ll now prepare our new template for our new OpenSearch-specific stack. If you previously used Former2 to define your stack, you’ll only need to make the DeletionPolicy change from the steps below.

Using the copy of your original template, carefully make the following adjustments:

Change the domain resource type from AWS::Elasticsearch::Domain to AWS::OpenSearchService::Domain
Add the DeletionPolicy and UpdateReplacePolicy attributes to the resource, as previously performed
In the domain resource, change the ElasticsearchVersion property to EngineVersion and set its value to OpenSearch_1.0
In the domain resource, change the ElasticsearchClusterConfig property to ClusterConfig, if set
In the domain resource, for everywhere an instance type is defined (InstanceType and DedicatedMasterType), change the .elasticsearch suffix to .search
In the domain resource, under ClusterConfig, if you have specified ColdStorageOptions you must remove it as it is not currently supported
If there are any Fn::GetAtt / !GetAtt references to your domains DomainArn (i.e. !GetAtt MyDomain.DomainArn), change these to instead use !GetAtt MyDomain.Arn
Replace the domains logical ID and references to it, if needed due to naming conventions
Comment out any resources not currently within the current stack (probably everything but the AWS::OpenSearchService::Domain)
Rename any output export names to be unique within the region, if needed (we will update references to these later)
Update the template Description and any comments to remove Elasticsearch references, if needed

My new template looks like this:

Resources:
  ...

  MyOpenSearchDomain:
    DeletionPolicy: Retain
    UpdateReplacePolicy: Retain
    Type: AWS::OpenSearchService::Domain
    Properties:
      DomainName: mydomain
      ClusterConfig:
        InstanceCount: 3
        InstanceType: t3.medium.search
        DedicatedMasterEnabled: true
        DedicatedMasterType: t3.medium.search
        DedicatedMasterCount: 3
        ZoneAwarenessEnabled: true
        ZoneAwarenessConfig:
          AvailabilityZoneCount: 3
      EBSOptions:
        EBSEnabled: true
        VolumeSize: 20
        VolumeType: gp2
      EngineVersion: "OpenSearch_1.0"
      ...

Importing the new stack

Once you have confirmed the in-place upgrade has completed, and prepared your new stack for import, we can import the new stack. You can check the Logs tab in your domain to confirm the upgrade.

Open the CloudFormation console and use the Create stack > With existing resources (import resources) option. Upload your newly prepared template with the AWS::OpenSearchService::Domain resource within it.

You’ll then be prompted for the domain name of the existing domain. Carefully copy this name from the OpenSearch console domain listing, or from the template if hardcoded. Proceed to give your new stack a name, which should be different than the one currently existing, then finalize the stack creation. As your domain is only being imported, this process should only take a moment.

After confirming the stack creation has been successful, uncomment any related resources from the stack such as CloudWatch alarms and update your stack in-place to ensure these are put back in. I recommend also adding a tag to the stack at this point to make doubly sure that CloudFormation is aware of the resource and is targetting it correctly. You can optionally also remove the DeletionPolicy and UpdateReplacePolicy attributes at this time, however they are a good safety feature for preventing against accidental deletions so I do recommend you leave them in place.

Cleaning up

We now have both the old and the new stack in place, so let’s get rid of the old one. You can ignore this part entirely if you did not have an existing stack prior.

If you have any stacks that are dependant on a CloudFormation output export from the original, you can now update those stacks to point to the new export name you defined.

Once you have updated all of the export references, you may delete the older Elasticsearch stack. Before doing this, double check the template for the stack you are deleting only contains a single AWS::Elasticsearch::Domain resource, and that it has at least DeletionPolicy: Retain at the same level as Type and Properties. If you have missed any export values, the stack events will inform you of this during the stack deletion, which you can remediate and reattempt the deletion.

Once the stack is deleted you’re done.

I hope this has been a helpful resource for you to upgrade your domain. Reach out to me on Twitter at @iann0036 if you liked this post. Happy searching!

Recommendations for Working with IAM - Permissions Boundaries and Conditions

2021-05-06T00:00:00+00:00

To celebrate AWS Identity and Access Management (IAM)’s 10th anniversary, I talk about two powerful ways that you can limit access to Amazon Web Services (AWS); Permissions Boundaries and Conditions.

Using permissions boundaries and conditions is an effective way to limit access. By letting you set the maximum permissions for a user or role, permissions boundaries can be used for situations like granting someone limited permissions management abilities.

Conditions enable you to specify when a policy statement is enforced, providing fine-grained access through variables such as tag value, time, and IP address. Using these IAM features will help you in your pursuit of least privilege on AWS.

Read more on the AWS Partner Network Blog by clicking here.

Case of the doppelgänger AWS account

2021-02-12T00:00:00+00:00

Recently, I discovered via a Twitter thread that a single address can be the root email on two AWS accounts at the same time. How is this possible? Could my account be compromised as a result?

Let’s take a look at the thread in question:

How is this possible?

It turns out the thread is technically correct, however there are a strict set of circumstances that allow for this to happen. This can only occur when you have:

An AWS account that is linked to your Amazon.com retail account (which is something that could only occur before some time in 2017); and
Another AWS account that is not linked to an Amazon.com account, or another Amazon.com linked account

There’s actually an easy way to check which account type you have. If you log into your account as the root user, navigate to the “My Account” dashboard and click the “Edit” button on the “Account Settings” section.

If the next page looks like this, you have an account that is linked to an Amazon.com account:

If the page instead looks like this, you have an account that is unlinked:

The only way to have the two AWS root accounts share the same email address is to change the Amazon.com linked account’s email address to that of the unlinked account. The reason this works is because Amazon.com authentication system is unaware of the unlinked AWS accounts and, for legacy reasons, the AWS authentication system will allow sign-ins via the Amazon.com accounts. This kind of problem doesn’t seem restricted to just AWS either.

Changing the Amazon.com account email address

To do this, we first log onto the Amazon.com account and navigate to the account settings.

From here, go to “Login & security” then hit “Edit” on the email address field. You’ll then be prompted to set a new email address, so we enter the email address of the root user of the unlinked AWS account. This then needs to be verified via an emailed OTP code (notably not your TOTP tokens). Finally, re-enter your password to confirm the change. You have now set both AWS root accounts to the same email address.

This becomes slightly more interesting if you happen to have 2 Amazon.com linked accounts. If you attempt to change the email address of one to the other, you’ll actually see that there is a process to override / disable an existing Amazon.com account. As the Amazon.com login is used for root account access, in this case you would effectively lose access to one of your root accounts if you did this (so don’t!).

But what now?

Effectively, you have now made the root account for the Amazon.com linked AWS account inaccessible. The account will still function as normal, IAM / SSO users can still log in and make changes, however if you attempt to use log in to the root account using its password it will fail. This is because when both accounts are sharing the same email address, the unlinked account is the one that will be authenticated against, regardless of whether the passwords are different or the same.

You can log in to the root account of the unlinked AWS account using its password, and perform all functions as normal. You can also use the reset password feature on the email address, which will reset the password on the unlinked AWS account, leaving the Amazon.com linked account untouched.

If you have had your fun and want to go back to a normal world, you can simply change the Amazon.com account email address back to its original form to restore access to the root account for its associated AWS account.

Is this a security concern?

No, even though it certainly feels like it. You’ll note that all the email ownership security controls still apply during this process, so to perform this you will need control of the both email addresses involved, as well as any SMS / TOTP verification that is associated on the accounts.

If you feel you’d rather no longer be affected by the underpants problem, you may be able to raise a support ticket with AWS Support to migrate your account from a linked to an unlinked state. There are some implications of this, but presumably the AWS Support team will be able to explain those to you. Do take care when doing this as there are stories of people having problems with their Amazon.com accounts, such as being unable to reset or remove MFA, after this process occurs.

tl;dr

Under special circumstances, two AWS accounts could share the same root email address but it is not a security problem and will probably never happen to you.