Fine-tuning Semgrep for Ruby Security: Pundit and SQL injection
In this blog post, we’ll go over the construction and tuning of a few Semgrep rules I created while looking at a Ruby on Rails application. Semgrep is a powerful code analysis tool, and while there are a fair number of community rules, the default rules don’t cover everything, and you get a lot more mileage when rules are customized to the codebase.
The first rule focuses on detecting a common API misuse pattern that can make Ruby applications susceptible to SQL injection. This is a generic rule and can be used without modification on most Rails applications. The second rule ensures the correct use of Pundit authorization in a Rails application on specified models. This is more of a “rule template” than a complete rule, and needs a bit of customization based on the codebase being scanned.
Before delving into the Pundit authorization check, let’s begin with the simpler rule tailored to identify potential SQL injection. This rule primarily targets the misuse of the sql_sanitize
family of operations, a bug pattern I’ve encountered multiple times in Ruby applications.
If you’re just looking for the rules, they are on https://github.com/s0rcy/semgrep-rules.
SQL injection in Ruby
The sanitize_sql
family of functions is a popular method of sanitizing untrusted input in Ruby applications, particularly in cases where simple model querying does not suffice and more complex SQL is needed. These functions sanitize an input string by escaping special characters, mitigating SQL injection.
The core function sanitize_sql
is an alias for sanitize_sql_for_conditions
, and I’ll use it as a representative for the family of vulnerable functions in the post. A typical expected use case, as per the rails API Docs, is as follows:
Accepts an array of SQL conditions and sanitizes them into a valid SQL fragment for a WHERE clause.
sanitize_sql_for_conditions(["name=? and group_id=", "foo'bar", 4]) # => "name='foo''bar' and group_id=4"
However, I feel some developers miss the neat little footgun in some of these functions which can lead to serious vulnerabilities:
This method will NOT sanitize a SQL string since it won’t contain any conditions in it and will return the string as is.
<sanitize_sql_for_conditions("name='foo''bar' and > group_id='4'") # => "name='foo''bar' and group_id='4'"
Essentially, using many functions from the sql_sanitize
family with just one parameter nullifies the sanitization process, and the string is returned as is with special characters unescaped. Here’s an example of how this misuse can occur in application code:
query = sanitize_sql("SELECT * FROM orders WHERE order_guid = #{order_guid}") # Vulnerable!
query = Orders.select("order_guid = '#{sanitize_sql(order_guid)}'") # Vulnerable!
sql = "SELECT * FROM users WHERE email = '#{params[:email]}'"
sanitized_sql = ActiveRecord::Base.__send__(:sanitize_sql, sql) # Vulnerable!
The safe way to safely sanitize parameters would be as follows (named parameters would also do):
query = sanitize_sql("SELECT * FROM orders WHERE order_guid = ?", order_guid) # Fixed
query = Orders.select(sanitize_sql("order_guid = ?", order_guid)) # Fixed
sql = "SELECT * FROM users WHERE email = ?"
sanitized_sql = ActiveRecord::Base.__send__(:sanitize_sql, [sql, params[:email]]) # Fixed
An observant reader might say that this is just another case of SQL injection due to string interpolation. This is true to an extent, and many bugs in practice might involve string interpolation eventually. Having encountered this issue in production code multiple times leads me to believe that the addition of a sanitization function could be misleading to both static analysis tools and human reviewers, causing this misuse to go unreported. Existing Semgrep community rules won’t catch most of these issues, for instance.
If sanitize_sql
is used in application code, it’s highly likely to be in a situation where untrusted data is being used to construct a SQL query. If the function doesn’t perform any sanitization, the underlying query would directly use the unsanitized data. This leads to the conclusion that if sanitize_sql
(or similar) is called with just one parameter, it is likely a SQL injection bug. This is a pattern we can detect with Semgrep!
The rule we’re aiming for should catch calls to sanitize_sql
and its kin, so we’ll group them into a catch-all regex $SQLSANITIZE
. We’ll then look for invocations of this function that have just one parameter. Here’s a preliminary attempt at the rule:
# Pattern fragment
- pattern: $SQLSANITIZE($X) # Function call with one parameter
- metavariable-regex:
metavariable: $SQLSANITIZE # Capture functions in $SQLSANITIZE regex
regex: ^(sanitize_sql|sanitize_sql_array|sanitize_sql_for_assignment|sanitize_sql_for_conditions|sanitize_sql_for_order)$
Note that sanitize_sql_like
is missing from the sanitize_sql
family regex above, as it accepts a single string without conditions and sanitizes it.
While this is a solid start, we can further refine it to catch more instances and reduce false positives. Our rule also captures arrays passed as a single argument, which form a significant portion of false positives. We’d still like to detect single-element arrays but filter out multi-element arrays. We can achieve this with a pattern-not-inside
expression.
# Rule fragment
- pattern-not-inside: $SQLSANITIZE([..., ...])
Given that Semgrep isn’t AST-aware, we need to manually add patterns to detect additional ways the code might call sanitize_sql
in Ruby, such as invoking via send
. Ultimately this is something best done as customized for the codebase being reviewed, taking into account programming idiosyncrasies. I’ve included a few generic patters in the completed rule below.
# Completed rule
rules:
- id: sanitize_sql
languages:
- ruby
message: sanitize_sql used with one arg, possible no-op
severity: WARNING
patterns:
- pattern-either:
- pattern: $SQLSANITIZE($X)
- pattern: $F(:$SQLSANITIZE, $X)
- pattern: $M::$SQLSANITIZE $X
- pattern-not-inside: $SQLSANITIZE([..., ...])
- pattern-not-inside: $F(:$SQLSANITIZE, [..., ...])
- pattern-not-inside: $M::$SQLSANITIZE [..., ...]
- metavariable-regex:
metavariable: $SQLSANITIZE
regex: ^(sanitize_sql|sanitize_sql_array|sanitize_sql_for_assignment|sanitize_sql_for_conditions|sanitize_sql_for_order)$
The latest rule is available at https://github.com/s0rcy/semgrep-rules.
Rule Tuning: The rule can be tuned to the codebase to reduce false positives and ensure accuracy. Do a standard grep search for $SQLSANITIZE functions in the codebase, and observe the count. Modify the rule to replace $X
with ...
to match anything, and remove the pattern-not-inside
clauses before running the modified rule. If the counts don’t match, the rule is missing some invocations and needs to be adjusted to the codebase.
Checking Pundit authorization
A great use for Semgrep is checking if expected patterns are followed around security sensitive operations. For instance, if a sensitive function call is followed by an authorization check before the current scope is complete. Pundit, a popular authorization framework for Rails applications (34k+ public dependents), happens to have an authorization check pattern very similar to this.
Let’s assume a Ruby on Rails web application designed to sell concert tickets. The application stores customer information, and allows retrieval of invoices via some API. Further, let’s assume that the application uses Pundit to enforce authorization controls.
The normal authorization pattern we’d expect to see in the code when a lookup is done against the database might be something like this:
# Invoice controller
def show
invoice = Invoices.find params[:invoice_id]
authorize invoice # Pundit authorization
return JSON.stringify(invoice)
A user-supplied parameter is used to look up an invoice, and the current session user’s permissions are checked against the object with the authorize
keyword. If the user does not have authorization to access invoice
, an authorization exception would be thrown.
A vulnerable pattern would be the absence of the authorize
check, which would allow callers of any authorization:
def show
invoice = Invoices.find params[:invoice_id]
return JSON.stringify(invoice)
There are of course a few other ways of calling Pundit, but this is a popular one. It is also a simple enough pattern to look for, and can be generalized as:
def fun
...
foo()
bar() # Ok
...
end
def moarfun
...
foo()
...
end # Not Ok: not followed by bar()
We can use this to find all invocations of a sensitive database query that are not followed by an authorization check.
First let’s define the pattern we are looking for.
$DATA = Invoices.find(...)
The ...
operator captures any number of arguments. $DATA
captures the pattern that stores the return value of the privileged call Invoices.find
. This becomes the main vulnerable pattern we are looking for.
We’re only interested in locations where a subsequent authorize
call isn’t made - so let’s define what a pattern with the authorize
call looks like, and we can remove these from matches of the first patter.
$DATA = Invoices.find(...)
...
authorize $DATA, ...
With these building blocks in hand, let’s put together our basic pattern block:
- patterns:
- pattern: |
$DATA = Invoices.find(...)
- pattern-not-inside: |
$DATA = Invoices.find(...)
...
authorize $DATA, ...
The pattern
key contains the pattern we are looking for, and from the results of matching that pattern, every result matching pattern-not-inside
is removed. This should leave us with only the cases where authorization is missing.
We can generalize the search for Invoices.find
with a pattern that matches classes with names $CLASSES
and methods $METHODS
that we might be interested in. Let’s assume there are two ActiveRecord models we’d like to check called SensitiveModel1
and SensitiveModel2
, and we want to capture methods that might return sensitive data.
This can be done using metavariable
regex pattern again, as below:
- metavariable-pattern:
metavariable: $PRIVCALL
pattern-either:
- patterns:
- pattern: $CLASSES.$METHODS
- metavariable-regex:
metavariable: $METHODS
regex: (find|find_*|where|first|last|select|all|pluck|joins)
- metavariable-regex:
metavariable: $CLASSES
regex: (SensitiveModel1|SensitiveModel2)
Putting these two together, we now have a rule that
- Identifies all data access to specified models
- Confirms that the request was authorized using Pundit
This can be a powerful check to run regularly as part of CI/CD, as it would provide strong assurances that every access to a sensitive model in the codebase has as associated authorization check. To put it another way, there should be no authorization or access control issues against these models if rule trigger no violations. I was able to find a significant number of IDORs and access control issues in a codebase I audited with this rule.
The full rule is on https://github.com/s0rcy/semgrep-rules, and is also shown below.
rules:
- id: find_missing_authorize
languages:
- ruby
message: Pundit authorization missing after database access
severity: WARNING
mode: taint
pattern-sources:
- pattern: params[...]
pattern-sinks:
- patterns:
- pattern: |
$DATA = $PRIVCALL(...)
- pattern-not: |
$DATA = $PRIVCALL(...)
...
authorize $DATA, ...
- metavariable-pattern:
metavariable: $PRIVCALL
pattern-either:
- patterns:
- pattern: $CLASSES.$METHODS
- metavariable-regex:
metavariable: $METHODS
regex: (find|find_*|where|first|last|select|all|pluck|joins)
- metavariable-regex:
metavariable: $CLASSES
regex: (SensitiveModel1|SensitiveModel2)
Rule Tuning: This rule isn’t useful in its unedited state, as $CLASSES
in particular needs to modified to reflect the models of the codebase. Hint: these may be identified by looking for Pundit policy files. $METHODS
currently consists of interesting model data access methods, but should be tuned as well. There are other considerations to think about; the codebase may call authorize differently, or an authorization check may have been done earlier making subsequent calls unnecessary in a context. Also of note, the codebase may be using a different Pundit authorization helper. This rule may require a bit more tuning but the payoff is substantial.
a tangent into find_by_sql
and binds
While scanning a few open source Ruby projects with the SQL injection rule, I found some interesting violations in the rails codebase. Notably, the first argument to find_by_sql
is passed “as is” as a single argument to sanitize_sql
in querying.rb
. In this case the finding is a false positive, as the argument could be an array. The rule doesn’t have context into what the single argument might be (though perhaps a future improvement?).
# https://github.com/rails/rails/blob/f404a949ce678dd41212a90521b3db6f3344a6ff/activerecord/lib/active_record/querying.rb#L62
def find_by_sql(sql, binds = [], preparable: nil, &block)
_load_from_sql(_query_by_sql(sql, binds, preparable: preparable), &block)
end
def _query_by_sql(sql, binds = [], preparable: nil, async: false) # :nodoc:
connection.select_all(sanitize_sql(sql), "#{name} Load", binds, preparable: preparable, async: async)
end
It did reveal something interesting to me, as I hadn’t realized that find_by_sql
accepts binds in two ways:
find_by_sql(sql, binds=binds)
: From the function signature, pass the SQL statement as the first parameter, and an array of binds as the second parameter. Binds are not sanitized, and the database is responsible for constructing the final SQL statement with the binds.find_by_sql([sql, binds], binds=[])
: The SQL statement and binds are combined in a array and passed as the first parameter, with an empty binds argument. In this mode, rails usessanitize_sql
to sanitize binds and construct the final SQL statement, and passes this string with no binds to the database connection. This pattern is reflected in the Rails example code and most example code found online.
# Option 1: find_by_sql(sql, binds=binds)
railsbox(dev)> @users = User.find_by_sql('SELECT * FROM users where name = ?', ['Jim'])
User Load (0.1ms) SELECT * FROM users where name = ? [[nil, "Jim"]]
railsbox(dev)> @users = User.find_by_sql('SELECT * FROM users where name = ?', ["Ji'm"])
User Load (0.1ms) SELECT * FROM users where name = ? [[nil, "Ji'm"]]
# Option 2: find_by_sql([sql, binds], binds=[])
railsbox(dev)> @users = User.find_by_sql(['SELECT * FROM users where name = ?', 'Jim'])
User Load (0.1ms) SELECT * FROM users where name = 'Jim'
railsbox(dev)> @users = User.find_by_sql(['SELECT * FROM users where name = ?', "Ji'm"])
User Load (0.1ms) SELECT * FROM users where name = 'Ji''m'
Even though binds are not sanitized by rails in the first option, it doesn’t result in immediate SQL injection. The values are passed separately and don’t modify the SQL statement. Assuming the receiving database handles binds appropriately, it shouldn’t be a problem - at least there didn’t see an issue with SQL databases I tested.
There may be some exotic databases or conditions where this could be exploitable or relevant, particularly if binds are incorrectly handled by a database. The first path which includes sanitization is likely preferred for the majority of cases. This gives us a good secure coding best practice to check for - binds should preferably be passed to find_by_sql
in the first parameter as an array to benefit from sanitization, instead of passing them separately as the second parameter (unless there is reason). I wrote up a lazy semgrep rule to detect uses of find_by_sql
with more than one parameter to mark the issue for manual review, as this should be an uncommon usage.
rules:
- id: find_by_sql_multiple_args
languages:
- ruby
message: find_by_sql used with more than one arg, review how binds are passed/sanitized
severity: INFO
patterns:
- pattern: $F.find_by_sql($X, ...)
- pattern-not-inside: $F.find_by_sql([...])
afterthoughts
Running Semgrep with the default and community rules is easy and great for low hanging fruit, but writing custom rules or customizing the community rules to the codebase leads to significant improvement in results. The rule language is easy enough to grok, and reading the community rules or some of the links below is a great way to get familiar with rule syntax. Developing a small but customized set of checks for the critical security patterns in a codebase can go a long way in providing strong assurances.
There are also limitations to Semgrep that I occasionally hit. Some of these had to do with my skill level with semgrep rule writing, which I tried to augment with GPT tools, but others are inherent with using a non-AST aware tool. While it’s easy to quickly develop rules, taking care of edge cases can be painful, and it’s tempting to manually review or exclude outliers when your rule feels “good enough”.
I did have a thought here - say an application has 120 instances of one way of using Pundit authorization, and 15 instances of another pattern (perhaps a different team/developer). Do the tooling/rules need to be extended to capture both cases, or should all code using a particular model be required to follow a specific authorization pattern? Requiring it for new code is an easy option, though existing callers still need to be accounted for. These requirements may cause developer friction and overhead, but the upside is being able to definitively say that “there are no access control issues against these models” every time the rule runs in the CI/CD pipeline. An interesting product security investment to consider, as it has the potential to remove a bug class from the application.