In Code Review Part II, we considered questions to ask performing a code review. Our discussion focused on three categories:
Functional correctness
Design and best practices
Readability
In this installment, we'll wrap things up by considering a few more categories:
Performance
Security
Testing
Miscellaneous
As we go through, remember that this isn't a checklist. Some questions will be more relevant than others, or just plain more interesting. Pick whatever works for you, the project, and the specific PR.
Performance
We should ensure that the code runs efficiently, without using more resources than necessary. Sometimes we need to make tradeoffs, the most common being the space-time tradeoff, but usually we can achieve a decent balance. Here are some things to look for.
Is the same data being regenerated inside a loop?
This is hands-down the most common performance issue I've seen. Consider this code addition.
for (let job in jobs) {
let servers = getServerDefinitions();
assign(job, servers);
}
The call to getServerDefinitions()
is independent of the specific job. We should run it once outside the loop and reuse the result.
let servers = getServerDefinitions();
for (let job in jobs) {
assign(job, servers);
}
Now getServerDefinitions()
might be quick (eg. O(1) or O(servers) where the expected number of servers is small). In this case, the code could ship either way without making a performance difference.
But consider these possible futures:
The function is currently O(1), but six months from now we decide server definitions need to be more dynamic and it becomes O(servers).
The function is currently O(servers) with a small number of servers, but as time passes the number of servers increases rapidly.
This second possible future is common with successful startups. It can be difficult to predict what is going to grow and how. An assumption made before you have product-market fit (eg. The number of servers is small, the number of seats for any given customer is small, etc) might turn into a drastically different reality by the time you get your Series B.
Are there calls to a database?
Working well with databases is a big topic in itself. For now, let's just say that it can be easy to accidentally put too much load on a database. Sometimes it’s because the code isn't caching responses, other times because abstraction layers make it difficult to tell what data is being fetched. I've seen production databases get taken out from expensive, low-use queries suddenly becoming high-use queries.
Are there network calls?
Any time your application calls out to another service you need to know what will happen when that service returns slow responses or errors or goes down entirely. How will your application handle it? Is there retry logic (preferably with exponential backoff)? What timeouts does the code use, are they realistically set, and what will result if the timeouts occur?
Is there the possibility of a memory leak?
Many of us are working in managed memory languages now, where this is less of a concern. For those programming in languages without this support, or who depend on libraries that do, consider whether all allocated memory is being disposed of. Connections and streams are common leak sources.
Does it lock shared resources?
Pay attention to locks. Is the lock scoped to the narrowest possible scope? Is there the potential for deadlock?
Security
We should consider some of the most common security risks for this application. The OWASP Top Ten is a great reference for these. Here are some recurring issues I've seen in reviews.
Are correct authorization checks present?
Authorization is “Does this user have sufficient permissions?”
If there is an API endpoint
DELETE /documents/{documentId}
Make sure that the logged-in user owns that document before deleting it.
Is the author rolling their own hashing, encryption, or sanitization?
It’s 2023 and I still see folks writing their own versions (or copying off StackOverflow or from an LLM). There are libraries for these things for a reason and it is very important that you use them.
Seeing functions like these should be a huge red flag.
def hash(input_string):
hash_value = 0
for i, char in enumerate(input_string):
hash_value += (i + 1) * ord(char)
return hash_value % 10007 # 10007 is a prime number
def xor_encrypt(plain_text, key):
encrypted_text = ''.join(chr(ord(c) ^ key) for c in plain_text)
return encrypted_text
def xor_decrypt(encrypted_text, key):
plain_text = ''.join(chr(ord(c) ^ key) for c in encrypted_text)
return plain_text
def html_sanitization(input_html):
input_html = input_html.replace('<', '<').replace('>', '>')
allowed_tags = ['<b>', '<i>', '<u>', '</b>', '</i>', '</u>']
for tag in allowed_tags:
input_html = input_html.replace(tag, tag.replace('<', '<').replace('>', '>'))
return input_html
Is user-supplied data being rendered as HTML?
Cross-site scripting (XSS) is when a user inputs a value that is later directly embedded as HTML for another user. There are many bad things you can do with this. I love React’s naming “dangerouslySetInnerHTML
” to make it abundantly clear that you need to be careful with this.
function Comment({ text }) {
return <div dangerouslySetInnerHTML={{ __html: text }} />;
}
If you absolutely have to do this as part of your feature set, make sure you are using a good sanitization library.
Is user-supplied data being concatenated to a database query?
SQL injection (and the equivalent for non-SQL systems) is when a user inputs a value that is directly embedded in a query without parameterization. This can allow the user to perform arbitrary actions on your database.
query = `SELECT * from documents WHERE id='${id}'`;
See also: the classic XKCD comic on the subject.
Long story short, always remember that data supplied by the end user cannot be trusted!
Testing
We should ensure that the new code is adequately tested. This could mean adding new automated tests, running existing automated tests, and/or manually testing. Different teams have vastly different expectations around testing. Be sure to adjust your own expectations accordingly.
Are the tests performed at the right level?
If the testing was all done manually, maybe some of it should be automated. If an integration test was added, check if a faster, simpler unit test would be sufficient.
Are the tests easily understandable?
It should be clear what is being tested and why. Automated tests can be brittle and these tests should be easy to update later as the product evolves.
Can you think of additional useful cases?
Put on your testing hat and consider what else should have been tested.
Miscellaneous
A few more came to mind but didn't cleanly fit into my simple categorization.
Should information be logged to telemetry?
If this project uses a telemetry or monitoring system, consider if additional information should be logged. Think about what information we might want to have when investigating a production issue. Think about what information product management will want when deciding what features to invest in.
Should this change be controlled by a feature flag?
If our system supports feature flags, consider if we need a way to disable or adjust this change in production without a rollback.
Have user-facing messages been checked with the product owner?
Depending on your team, you might have a Product Manager or Designer or UX person who is responsible for user-facing text. Make sure messages have been approved and call out any messaging that sounds off.
Does the change require edits to public documentation?
Your company’s Customer Success, Sales, and Marketing teams almost certainly have documentation covering how specific features work and what the UI looks like. Be proactive that they aren’t providing customers with outdated information.
Were third-party libraries added?
There are several things that can go wrong with an added dependency
The dependency isn’t actually used by the code. Sometimes dependencies added during exploratory work are never removed.
The dependency has an inappropriate license for the project. It may only allow noncommercial use or be a copyleft license.
The dependency is no longer maintained and is a maintainability risk for the project.
Once I heard about a potentially useful library and asked a teammate to investigate it. They came back with a list of pros and cons and a recommendation that we use the library. Great! They went and did the integration work and sent the PR my way.
The first thing I see is that the dependency looks unusual. Instead of
"library-name": "^1.5.0"
I see
"library-name": "https://cdn.libraryName.com/libraryName-1.5.0/libraryName-1.5.0.tgz"
Looking into it, the library maintainer had some kind of disagreement with NPM and was hosting newer versions on their own CDN.
There was no way we were shipping with a dependency on some random dude’s CDN. My teammate had done all this work that went to waste. I was sad.
Merge and Move On
What have we forgotten in our discussion about code reviews? Is there something else you like to look for? Or stories of something important being caught (or missed!) in review?
After this week we'll move on to another topic. Thanks for sticking with us. If there is a particular topic you'd like us to get into in the coming weeks, let us know in a comment or email reply!