3 December 2024
13 Min. Read
Understanding Feature Flags: How developers use and test them?
Let’s get started with a quick story:
Imagine you’re a developer, and you’ve shipped out a new feature after testing it well. You sigh a moment of relief. But too soon when you start see your PagerDuty console or Prometheus alert manager buzzing with unexpected spikes in error rates, endpoint failures and container crashes. What is going wrong?
Now you doubt if you tested this new feature enough, if you missed an edge case or an obvious enough scenario in the hurry to get the feature live.
But you tested it thoroughly once locally before committing, and then again when you raised the PR. How did you miss these obvious failures?
Oh, that’s the issue. The real test of a new feature is front of real users who use the app in different, unthinkable ways, hard to replicate in a controlled environment like dev or ‘stage’. Besides the actual deployment of a new (maybe incomplete) feature can be minimised if it is released to a smaller group of users over everyone-at-once which delivers real feedback without the impending risk.
So, what’s the solution here?

Feature flags originated as a solution to several challenges in software development, especially in the context of large, complex codebases. In traditional development, new features could only be developed in separate branches and merged when complete, leading to long release cycles. This created bottlenecks in the development process and sometimes even introduced risks when deploying large changes.
What are Feature Flags?
Feature flags are conditional statements in code that control the execution of specific features or parts of a system. They allow developers to turn features on or off dynamically without changing the underlying code.
Flags can be applied to:
New Features: Enabling or disabling new functionality during development or A/B testing.
Release Control: Gradually rolling out features to users (e.g., for canary releases).
Performance Tuning: Toggling between performance configurations or optimizations.
Security: Disabling certain features during security incidents or emergency fixes.

How does a Feature Flag look like?
A Feature Flag is typically implemented as a conditional check in the code, which determines whether a specific feature or behavior should be enabled or disabled.
Simple example of a feature flag:
boolean isNewFeatureEnabled = featureFlagService.isFeatureEnabled("new-feature");
if (isNewFeatureEnabled) {
// Execute code for the new feature
System.out.println("New feature is enabled!");
} else {
// Execute legacy behavior
System.out.println("Using the old feature.");
}
How a complex feature flag looks like?
Feature flags can also be more complex, such as targeting a specific group of users or gradually rolling out a feature to a percentage of users.
let user = getUserFromContext();
if (featureFlagService.isFeatureEnabledForUser("new-feature", user)) {
// Activate feature for specific user
console.log("Welcome, premium user! Here's the new feature.");
} else {
// Show default behavior
console.log("Feature is not available to you.");
}
The flag is essentially a key-value pair, where the key represents the name of the feature and the value dictates whether it's active or not.
Who uses feature flags?
Feature flags are integrated directly into the code, so their setup requires a development or engineering team to configure them within the application.
Consequently, software developers are often the primary users of feature flags for controlling feature releases.
✅ They also facilitate A/B testing and experimentation, making it possible to test different versions of a feature and make data-driven decisions.
✅ Gradual rollouts allow features to be released to internal users, then beta testers, and finally everyone, with the option to quickly toggle the feature off if issues arise.
✅ Feature flags enable developers to work directly in the main branch without worrying about conflicts, reducing merge headaches.
✅ They also optimize CI/CD workflows by enabling frequent, small deployments while hiding unfinished features, minimizing the risks associated with large, infrequent releases.
What results can devs in FinTech achieve by using feature flags?
We’re specifically talking about the banking apps here since those apps hinges on fast, reliable, and safe software delivery, but many banking institutions are slow to change, not because of a lack of motive, but because archaic infrastructure and legacy code stand in the way.
Companies like Citibank and Komerční Banka have successfully updated their systems by using feature flags to ensure security and smooth transitions.
Komerční Banka releases updates to non-production environments twice a day and has moved 600 developers to its New Bank Initiative.
Alt Bank shifted from a monolithic system to microservices and continuous deployment, connecting feature flags to both their backend and mobile app.
Rain made it easier for their teams by removing the need to manually update configuration files. Now, they can control user segments and manage feature rollouts more easily.
Vontobel increased development speed while safely releasing features every day.
How Feature Flags Function?
Toggle at Runtime: Feature flags act as switches in your code. You can check if a flag is enabled or disabled and then decide whether or not to execute certain parts of the code.
It's like adding a conditional if check around a feature you don’t want to expose yet.
Dynamic Control: Flags can be managed externally (e.g., via a dashboard or config file) so they can be flipped without deploying new code.
Granular Rollouts: Feature flags can be set per-user, per-region, or even per-application version. You can roll out a feature to a small subset of users or to all users in a specific region.
Remote Flags: Some flags can be controlled remotely, using a feature flag service or API. This lets teams update flags without needing to touch the code.
Flags as Variables: Under the hood, flags are just boolean variables (or maybe more complex types, like integers or strings). They're checked at runtime to control behavior—just like how environment variables work for config, but with the added flexibility of toggling things at runtime.
Gradual Rollout: Instead of flipping a feature on for everyone all at once, you can roll it out incrementally—first for internal devs, then beta testers, then a few power users, and eventually, the entire user base. This reduces risk by catching issues early, before the feature goes full-scale.
This means less downtime, fewer bugs in production, and faster iterations. Feature flags are like cheats for managing releases—flexible, fast, and low-risk.
Top 5 Tools for Feature Flag Services
Feature flags are crucial tools for managing feature deployment and testing in modern development environments. Let’s discuss the top 5 feature flag services to help you get started with:
Feature | LaunchDarkly | Flagsmith | Unleash | Optimizely | |
Ease of Setup | Easy, with quick integration | Easy for small projects, moderate for enterprise | Moderate, documentation varies | Can be complex due to open-source nature | Straightforward for experienced teams |
User Interface | Highly intuitive and user-friendly | Clean, but can be confusing for new users | Functional but lacks polish | Basic, less intuitive | Polished and user-focused |
Custom Rule Capabilities | Highly flexible with custom rules | Good, but less flexible than LaunchDarkly | Limited to simple rules | Mostly basic, some advanced features in paid versions | Very sophisticated, great for complex setups |
Client-Side Performance | Very efficient, minimal latency | Efficient, with good SDK performance | Moderate, depending on setup | Can vary, self-hosting impacts performance | High-performance, especially in mobile environments |
Adaptability to Complex Environments | Best for highly dynamic environments | Good, requires some custom setup | Not ideal for very complex setups | Varies with installation | Excellent for multi-platform environments |
Scalability | Handles scaling seamlessly | Scales well, some planning needed | Can struggle in large-scale implementations | Scaling can be challenging in self-hosted | Designed for large-scale enterprises |
Update Frequency | Constant updates with new features | Regular updates, sometimes slower | Infrequent updates, depends on community | Infrequent, open-source pace | Regular, innovation-focused updates |
LaunchDarkly
LaunchDarkly offers powerful real-time updates, granular targeting, robust A/B testing, and extensive integrations. It’s ideal for large teams with complex deployment needs and supports a full-feature lifecycle.
Pricing: Subscription-based with custom pricing depending on usage and team size.
Split.io excels in feature experimentation with A/B testing, detailed analytics, and easy-to-use dashboards. It integrates well with popular tools like Datadog and Slack and supports gradual rollouts.
Pricing: Subscription-based, with custom pricing based on the number of flags and users.
Flagsmith
Flagsmith is open-source, providing the flexibility to self-host or use its cloud-hosted version. It supports basic feature flagging, user targeting, and simple analytics, making it ideal for smaller teams or those wanting more control.
Pricing: Freemium model with a free tier and subscription-based plans for larger teams.
Unleash
Unleash is an open-source tool that offers full flexibility and control over feature flagging. It has a strong developer community, supports gradual rollouts, and can be self-hosted to fit into any tech stack.
Pricing: Open-source (self-hosted, free), with premium support and cloud-hosted options available for a fee.
Optimizely
Optimizely is robust for feature experimentation and A/B testing, with excellent support for multivariate testing. It provides advanced user targeting and detailed analytics, making it a good choice for optimizing user experiences.
Pricing: Subscription-based, with custom pricing depending on the scale of experimentation and features required.
Why Testing Feature Flags are crucial?
Testing feature flags is absolutely crucial because, without it, there’s no way to ensure that toggles are working as expected in every scenario.
Devs live in a world of multiple environments, users, and complex systems, and feature flags introduce a layer of abstraction that can break things silently if not handled properly.
Imagine pushing a new feature live, but the flag’s logic is broken for certain user segments, leading to bugs only some users see, or worse, features that should be hidden are exposed.
You can’t afford to let these flags slip through the cracks during testing. Automated tests are great, but they don’t always account for all the runtime flag states, especially with complex rules and multi-environment setups. Feature flags need to be thoroughly tested in isolation and within the larger workflow—checking flag toggling, multi-user behavior, performance impact, and edge cases.
If a flag is misbehaving, it can mean the difference between smooth rollouts or catastrophic rollbacks. Plus, testing feature flags helps catch issues early—before they make it to production and cause unplanned downtime or customer frustration. In short, feature flags might seem simple but testing them is just as important as testing the features they control.
Problems with Testing Feature Flags
Testing feature flags can be a real pain in the neck.
✅ For one, there’s the issue of environment consistency—flags might work perfectly in staging but fail in production due to differences in user data, network conditions, or backend services.
✅ Then, there’s the complexity of flag states—it’s not just about whether a flag is on or off, it’s about testing all possible combinations, especially when dealing with multiple flags interacting with each other. If flags are linked to user-specific data or settings (like targeting only a subset of users), testing each permutation manually can quickly spiral out of control.
The Current State of Testing Feature Flags
Currently, feature flags are being tested through a mix of unit tests (to check flag states in isolated components), integration tests (to ensure flags interact correctly across services), and E2E testing (to simulate real-world flag scenarios). But it’s often a manual setup at first, before implementing tools like LaunchDarkly, Split.io, or custom testing frameworks. Some teams write mocking tools to simulate different flag states, but these can get out of sync with the actual feature flag service.
➡️ Since states are involved here, manual testing is the most common way to test the toggling nature of these feature flags. But it is prone to errors and can’t scale. Devs often end up toggling flags on and off, but unless there's solid automation to verify those states under various conditions, things can easily break when flags behave differently across environments or after an update. Also, you can't always trust that a flag toggle will always trigger the expected behavior in edge cases (like race conditions or service outages).
➡️ Some devs rely on feature flag testing frameworks that automate toggling flags across test scenarios, but these are often too generic or too complex to fit the specific needs of every app.
➡️ End-to-end (E2E) testing is useful but can be slow, especially with dynamic environments that require flag values to be tested for different users or groups. Another challenge is testing the fallback behavior—when flags fail, do they default gracefully, or do they bring down critical features?
Ultimately, testing feature flags properly requires continuous validation, automated checks for each flag change, across different segments, environments, and use cases.
The Right Test Strategy for Teams working with Feature Flags
Many people mistakenly believe they must test every possible combination of feature flags in both on and off states. This approach quickly becomes impractical due to the sheer number of combinations. In reality, testing every flag combination isn't necessary—or even possible. Instead, focus on testing a carefully selected set of scenarios that cover the most important flag states.
Consider testing these key flag combinations:
Flags and settings currently active in production
Flags and settings planned for the next production release, including combinations for each new feature
States that are critical or have caused issues in the past
✅ Testing in production
We all know unit tests and integration/E2E tests comes pretty handy for testing feature flags, but they all come with their own set of limitations. So, here we are going to discuss one workable approach that eliminates the need for you to:
➡️ prepare test data for testing each possible combination of feature flag “on” and “off” stage
➡️ manage multiple environments, when you can reap the maximum benefits when you’re testing in production
➡️ testing in isolation, when you can test with the real traffic your application gets to get more confidence with your feature states
Let's discuss the approach in detail:
The best way to test feature flags is to test them naturally alongside your regular code testing. This involves a record and replay approach where you set up your services with the solution SDK in your production environment (which receives real traffic, leading to higher confidence). The SDK records all incoming requests to your app and establishes them as a baseline response. This recorded version automatically captures all interactions between your services, database calls, and third-party API communications.
Here's how the testing works:
Let's say you've created two new feature flags that need testing. The SDK records a new version of your app with all the changes and compares it with the baseline version. It not only identifies discrepancies between versions but also helps you understand how your feature flags affect the user journey. This approach is both fast and scalable across multiple services:
Services don't need to remain active during testing
Workflows can be recorded and tested from any environment
All code dependencies are automatically mocked and updated by the system
This approach is ideal for gaining confidence and getting instant feedback that your code will work correctly when integrating all components together. Major e-commerce companies like Nykaa and Purplle, which rely heavily on feature flags, are successfully using this approach to maintain stable applications.
✌️ Simulate Real-World Conditions
✌️ Test Flag Combinations and Interactions using Integration tests
✌️ Automate Flag Testing with Continuous Integration
Do these goals align with what you want to achieve? If so, share your details with us, and we'll help you implement seamless feature flag testing.
Conclusion
When you’re working with feature flags, it is pretty obvious that you must be maintaining staging environments. But the problem occurs when the tested built is passed on to the prod environment and there it reports bugs or errors. And that’s true also, since there are “n” of conditions under each feature flag which can’t be tested properly in staging, as seeding and preparing test data to cover all the scenarios and edge cases is also a challenge in itself.
Hence, a smart testing approach that tests the source code of feature flags naturally with the real traffic can be one solution to come out of this problem.
Related to Integration Testing