Designing Step-Projects to Validate Product Ideas Before Full Commitment

This skill teaches you to decompose product ideas into small, time-boxed experiments (step-projects) of no more than 10 weeks that test your riskiest assumptions with measurable outcomes, so you build evidence iteratively instead of betting everything on a big launch.

Define the riskiest assumption behind your product idea, then design a step-project of no more than 10 weeks that tests that single assumption with a concrete, measurable success metric. Start with the cheapest evidence method available, such as a fake door test or concierge MVP, and define your go/no-go criteria before the experiment begins. Each step-project should produce a clear result that either builds confidence in the idea or redirects your effort.

Outcome: You produce a concrete step-project brief that specifies the assumption being tested, the experiment format, the time box, the success metric, and the go/no-go threshold, enabling your team to validate or invalidate a product idea in weeks instead of months.

Synthesized from public framework references and reviewed for accuracy.

ProductIntermediate1-2 hours per step-project design

Prerequisites

  • Familiarity with the GIST Planning Framework's four layers (Goals, Ideas, Step-projects, Tasks)
  • A scored idea backlog with at least basic ICE scores (see prioritizing-ideas-with-ice-scoring)
  • Understanding of hypothesis-driven development or Lean Startup basics
  • Access to a product analytics tool or user research channel for measuring experiment results

Overview

Step-projects are the engine layer of the GIST Planning Framework. Where Goals set direction and Ideas propose how to get there, step-projects do the actual learning. Each one is a small, time-boxed experiment, never longer than 10 weeks, designed to test a single critical assumption behind an idea. The artifact you produce is a step-project brief: a one-page document that names the assumption, describes the experiment, sets the time box, and defines the metric and threshold that will determine whether the idea deserves further investment.

The distinction between a product manager vs project manager becomes especially clear at this layer. A project manager would take the idea and build a delivery plan with milestones, dependencies, and resource allocations. A product manager designs a step-project that answers a question first. Should we build this at all? Will users actually change their behavior? Is the technical risk manageable? Step-projects exist because most ideas are wrong, and full-scale delivery of a wrong idea is the most expensive mistake a product team can make. By framing work as a sequence of experiments, you convert vague conviction into structured evidence.

The concept borrows heavily from the Lean Startup's build-measure-learn loop, but it adds structure that the Lean Startup model leaves ambiguous. Specifically, step-projects enforce a maximum duration, require a pre-committed success metric, and are sequenced so that each one addresses the next-riskiest assumption. This prevents the common failure mode where teams run experiments endlessly without converging on a decision. A well-designed step-project ends with a binary outcome: either the assumption holds (move to the next step-project for this idea) or it fails (pivot, rethink, or kill the idea). The concrete output of this skill is a completed step-project brief that your team can immediately decompose into daily tasks and begin executing in the next sprint cycle.

Step-projects also create a natural cadence for your planning process. While goals are reviewed quarterly and ideas are scored and banked continuously, step-projects operate on a 2-10 week rhythm. This cadence integrates naturally with how cross-functional teams already work, making it possible to run multiple step-projects in parallel across different ideas without overwhelming any single team. For more on how these cadences interlock, see managing multi-cadence planning cycles.

How It Works

The core mental model behind step-projects is assumption risk ordering. Every product idea is really a bundle of assumptions: users have a certain problem, they will adopt a certain solution, the solution is technically feasible, it will move the business metric we care about, and so on. Step-projects work by identifying which assumption carries the most risk (meaning the highest combination of uncertainty and consequence if wrong), then designing the cheapest possible experiment to test that assumption, and finally defining what evidence would make you confident enough to move forward.

This is fundamentally different from how traditional roadmaps work, and it is also where the product manager vs project manager distinction becomes most operationally visible. A project manager sequences work by delivery dependencies: what must be built first so the next thing can be built on top of it. A product manager designing step-projects sequences work by learning dependencies: what must be validated first so the next investment is not wasted. The ordering principle is risk, not architecture.

The experiment format you choose depends on which assumption you are testing and how much evidence you need. The GIST Planning Framework encourages a spectrum of experiment types, ordered from cheapest to most expensive. At the low end, you have assessment experiments: competitive analysis, customer interviews, or data mining from existing analytics. These cost almost nothing and can be done in a few days. In the middle, you have simulation experiments: fake door tests, painted door tests, Wizard of Oz setups, or concierge MVPs where you manually deliver the experience to a handful of users. At the high end, you have MVP experiments: functional but minimal versions of the product that real users interact with at small scale. The key insight is that you should always start with the cheapest experiment type that could disprove your assumption. If five customer interviews reveal that nobody has the problem you are solving, you have saved yourself months of engineering work.

The success metric for each step-project must be defined before the experiment starts. This is not optional and it is not negotiable. Pre-committing to a threshold, such as 15% of landing page visitors clicking the signup button or 4 out of 5 interviewed users describing the problem unprompted, prevents the rationalization that happens when teams see ambiguous results and talk themselves into continuing. The threshold should be calibrated against the confidence level you need. Early-stage step-projects testing desirability might accept a lower bar (evidence of genuine interest). Later-stage step-projects testing scalability or retention need a higher bar because the next investment is larger.

Time-boxing to 10 weeks maximum serves two purposes. First, it forces scope discipline. If your experiment cannot produce a result in 10 weeks, you are probably testing too many assumptions at once, or you have designed an experiment that is really a product launch in disguise. Second, it creates a natural decision point. At the end of every step-project, you must decide: advance to the next step-project, pivot the idea, or kill it. This forced cadence prevents ideas from lingering in an ambiguous state where nobody wants to make a call.

Step-by-Step

  1. Step 1: Select the Idea and Review Its ICE Score

    Pull the highest-priority idea from your idea bank that has been scored using the ICE framework (Impact, Confidence, Ease). Review the score components individually, not just the aggregate. Pay special attention to the Confidence score because this is where step-projects do their work. A low Confidence score means there are unvalidated assumptions that need testing.

    If Confidence is already high (8 or above out of 10), ask whether a step-project is even necessary, as the idea may be ready for direct implementation.

    Tip: If the idea is too vague to state in one sentence, it is not ready for a step-project. Send it back to the idea bank for further definition. Step-projects test specific bets, not general directions.

  2. Step 2: List All Underlying Assumptions

    Break the idea apart into every assumption it depends on. Write each assumption as a falsifiable statement. Common categories include desirability assumptions ('Users want this'), feasibility assumptions ('We can build this within X constraints'), usability assumptions ('Users can figure out how to use this'), and viability assumptions ('This will move our business metric'). Aim for 5-15 assumptions per idea.

    Do not filter or prioritize yet. The goal of this step is completeness. A good technique is to ask each team member to write their assumptions independently on sticky notes or in a shared document, then merge and deduplicate. This surfaces assumptions that any single person might have overlooked.

    Tip: The most dangerous assumptions are the ones nobody writes down because everyone 'just knows' they are true. Actively probe for these by asking: 'What would have to be true about user behavior for this idea to work?' and 'What are we assuming about the market that we have never actually tested?'

  3. Step 3: Rank Assumptions by Risk

    For each assumption, score two dimensions on a simple 1-3 scale: uncertainty (how little evidence you have) and consequence (how much damage it causes if the assumption is wrong). Multiply the two scores. The highest-scoring assumption is your riskiest and the one your first step-project should target. If two assumptions tie, choose the one that is cheaper to test.

    Write the top 3 ranked assumptions in order, because the sequence of your step-projects will follow this ranking. If the riskiest assumption fails, you will not need to test the others for this idea.

    Tip: Do not let the team debate risk scores for more than 15 minutes. Use a quick dot-vote or silent scoring round to surface where genuine disagreement exists, then discuss only the disagreements. Spending two hours debating whether an assumption is a 2 or a 3 defeats the purpose of rapid iteration.

  4. Step 4: Choose the Experiment Type

    Match the riskiest assumption to the cheapest experiment type that could disprove it. Use this hierarchy: for desirability assumptions, start with customer interviews (5-8 users) or a fake door test (a landing page or in-app button that measures interest before anything is built). For usability assumptions, use a prototype test with 5 users using Figma or paper prototypes. For feasibility assumptions, build a technical spike or proof of concept with a 1-2 week time box.

    For viability assumptions, run a small-scale concierge or Wizard of Oz test where you manually deliver the value proposition to 10-20 users and measure retention or willingness to pay. Document why you chose this experiment type and what alternatives you considered.

    Tip: Teams consistently over-invest in experiment fidelity. If you are testing whether users want a feature, you do not need a working prototype. A screenshot, a landing page, or even a well-crafted interview question can test desirability at a fraction of the cost.

  5. Step 5: Define the Success Metric and Threshold

    ' The metric must be directly observable and tied to the assumption being tested. If you are testing desirability, the metric might be signup rate, click-through rate, or unprompted problem mention rate in interviews. If you are testing feasibility, the metric might be response time, error rate, or engineering hours to build. The threshold should be a number you commit to before running the experiment.

    ' Write this threshold down and share it with stakeholders.

    Tip: Set your threshold slightly above the break-even point for the next investment, not at the level you hope for. If you need 10% conversion to justify building the full feature, set the step-project threshold at 12-15% to account for the optimism bias that inflates small-sample results.

  6. Step 6: Set the Time Box and Scope Guard Rails

    Choose a duration between 1 and 10 weeks. Most step-projects should fall in the 2-4 week range. Anything shorter than 2 weeks may not produce enough data, and anything longer than 6 weeks is a warning sign that you are trying to test too much. Write explicit scope boundaries: what is included in this step-project and, equally important, what is excluded.

    For example, 'This step-project includes building a clickable prototype and running 6 user tests. ' These guard rails prevent scope creep, which is the single most common failure mode for step-projects.

    Tip: If the team argues that 10 weeks is not enough, challenge them to split the step-project into two sequential experiments. Almost always, the first half of what they wanted to build is sufficient to answer the riskiest assumption, and the second half addresses a different assumption entirely.

  7. Step 7: Write the Step-Project Brief

    Consolidate everything into a one-page brief using this structure: Idea (one sentence), Assumption Being Tested (the riskiest assumption from step 3), Experiment Type (from step 4), Success Metric and Threshold (from step 5), Time Box (from step 6), Scope (what is in and what is out), Team and Resources (who is working on this and what budget is allocated), and Decision Rules (what happens if the metric is met, what happens if it is not, and what happens if results are ambiguous). The brief should be readable by anyone on the team or any stakeholder in under 3 minutes. Share the brief with the team and get explicit agreement on the decision rules before starting.

    Tip: Store all step-project briefs in a single location, such as a wiki page or shared folder, indexed by the parent idea. This creates an evidence trail that makes future prioritization decisions faster and prevents the team from re-testing assumptions that have already been validated or invalidated.

  8. Step 8: Run the Experiment and Collect Data

    Break the step-project into daily or weekly tasks using the process described in breaking step-projects into daily tasks. During execution, track the success metric continuously, not just at the end. Set up a simple dashboard or tracking spreadsheet that the entire team can see. Hold a brief weekly check-in (15 minutes maximum) to review the data so far and flag any risks to the experiment's validity, such as sample bias, technical issues, or scope creep.

    Do not change the success metric or threshold mid-experiment. If you discover that you are measuring the wrong thing, note it as a learning and commit to running a follow-up step-project with corrected metrics.

    Tip: Resist the temptation to peek at results daily and make premature decisions. Most experiments need a minimum sample size to produce reliable results. For quantitative metrics, wait until you have at least 100 observations before drawing conclusions. For qualitative metrics like interviews, 5-8 conversations usually reveal the dominant pattern.

  9. Step 9: Evaluate Results and Make the Go/No-Go Decision

    At the end of the time box, compare the actual metric to the pre-committed threshold. There are three outcomes. If the metric meets or exceeds the threshold, advance to the next step-project for this idea, which should target the next-riskiest assumption from your ranked list. If the metric clearly falls short, document the learning and either pivot the idea (change the approach while keeping the goal) or kill it and redirect resources to the next idea in your backlog.

    If the results are ambiguous, meaning the metric is close to the threshold but not clearly above or below, refer to your pre-committed decision rules. In most cases, ambiguous results should be treated as a failure, because if the signal is not clear in a controlled experiment, it will be even weaker in the noise of a full-scale launch. Update the idea's ICE Confidence score based on what you learned.

    Tip: Hold a 30-minute retrospective after each step-project. Focus on two questions: 'What did we learn about the idea?' and 'What did we learn about how we run experiments?' The second question compounds over time and makes every future step-project faster and sharper.

Examples

Example: B2B SaaS team testing demand for an AI feature

A 12-person B2B SaaS company that sells project management software has an idea to add AI-powered task prioritization. The idea scored ICE 7/3/6 (high impact, low confidence, moderate ease). The team has 500 active customers and a 2-week sprint cycle. The product manager vs project manager distinction matters here because a project manager would start planning the engineering work, while the product manager needs to validate whether customers actually want AI-assisted prioritization.

The team listed 8 assumptions. The riskiest was 'Project managers trust AI to prioritize their tasks' (uncertainty: 3, consequence: 3, risk score: 9). They designed a 2-week step-project using a fake door test: they added an 'AI Prioritize' button to the task list view that, when clicked, showed a modal explaining the upcoming feature and asking users to join a waitlist. ' They set guard rails excluding any actual AI development.

After 2 weeks, 12% of users clicked and 45% joined the waitlist. The assumption was validated, so the team advanced to a second step-project: a 4-week concierge test where a team member manually re-prioritized tasks for 15 waitlist volunteers and measured whether they followed the AI suggestions. The Confidence score for the idea moved from 3 to 6 after the first step-project.

Example: Small startup testing a new market segment

A 3-person startup building an expense tracking app for freelancers has an idea to expand into small agency teams (5-15 people). The idea scored ICE 8/2/4. The team has very limited engineering resources and a $500 experiment budget. They need to validate whether agencies have the same pain points as freelancers before investing any development time.

' The team designed a 3-week step-project using customer interviews. They recruited 8 agency owners through LinkedIn outreach and a small Reddit ad ($200 spend). ' The team conducted 45-minute interviews using a structured script that avoided leading questions. Results: 6 out of 8 described expense tracking as a major pain, but the pain was different, focusing on approval workflows and policy compliance rather than receipt capture.

Only 2 expressed willingness to pay $50/month for the team use case. The team pivoted the idea from 'freelancer expense tracking for agencies' to 'expense approval workflow tool for small agencies' and designed a follow-up step-project to test the revised value proposition with a landing page and mockups.

Example: Large product team testing a technical feasibility assumption

A 40-person product organization at an e-commerce company has an idea to offer real-time personalized pricing. The idea scored ICE 9/4/2. The low Ease score reflects uncertainty about whether the pricing engine can respond within 50ms at scale. The team includes a dedicated data science group and has access to a staging environment with production-like traffic.

The team identified 11 assumptions. Normally desirability would be tested first, but the feasibility risk was so severe (the entire idea depends on sub-50ms response times) that a failed technical spike would save months of user research. They designed a 3-week step-project: build a proof-of-concept pricing engine using a simplified model, deploy it to the staging environment, and run load tests simulating 10,000 concurrent requests. ' The scope explicitly excluded production deployment, A/B testing infrastructure, and any customer-facing changes.

The result: P95 was 72ms. The team analyzed the bottleneck (database lookups for user history) and determined that caching could reduce latency by 40%, bringing estimated P95 to 43ms. They advanced to a second step-project: a 4-week engineering spike to implement the caching layer and re-run the load test. The Confidence score moved from 4 to 5, with the understanding that a second validation was needed before user-facing experiments.

Example: B2C mobile app testing a retention mechanic

A fitness app with 50,000 monthly active users has an idea to add streak-based challenges (e.g., 'Work out 5 days this week to earn a badge'). The idea scored ICE 7/5/7. The product manager is concerned that streaks might increase short-term engagement but cause burnout and churn. The product manager vs project manager difference is stark here: a project manager would estimate the development timeline, while the product manager needs evidence that streaks help retention rather than hurt it.

' The team designed a 6-week step-project using a simplified MVP: they implemented a basic streak counter (no badges, no social features, just a 'You have worked out X days in a row' message) and deployed it to 10% of users via a feature flag. ' The scope excluded gamification elements, social sharing, and streak recovery mechanics. After 6 weeks with 5,000 users in each group, the streak group showed 8 percentage points higher 30-day retention and no measurable increase in churn. The assumption was validated.

The team advanced to a second step-project focused on whether adding badges and social sharing would amplify the effect or dilute it. The idea's Confidence score moved from 5 to 8.

Best Practices

  • Test one assumption per step-project, not multiple. When you bundle two assumptions into one experiment, a positive result tells you nothing about which assumption was validated, and a negative result leaves you guessing about which assumption failed. If you catch yourself writing a brief that says 'This tests whether users want it AND whether we can build it,' split it into two step-projects.

  • Always define your success threshold before the experiment begins, and share it with at least one stakeholder outside the team. Pre-committing to a number in writing eliminates the cognitive bias where teams rationalize mediocre results as 'good enough' after seeing the data. If nobody is willing to put a number on success, the idea is not well enough defined for a step-project.

  • Start with the cheapest experiment type that could disprove the assumption. Teams default to building MVPs because building feels productive, but a 3-day interview sprint or a fake door test often provides the same evidence at 5% of the cost. Save the expensive experiments for assumptions that cheaper methods cannot address, such as performance at scale or long-term retention.

  • Maintain a step-project evidence log that records the idea, the assumption, the experiment, the threshold, and the actual result for every step-project you run. This log becomes your team's institutional memory. It prevents re-testing assumptions that have already been validated, speeds up future ICE scoring, and gives stakeholders a transparent record of how decisions were made.

  • Cap the team size for any single step-project at 2-4 people. Larger teams introduce communication overhead that slows the experiment and adds cost without improving the quality of evidence. If the experiment genuinely requires more people, that is a signal the scope is too large and should be split.

  • Schedule a fixed 'decision day' at the end of each step-project where the team and at least one stakeholder review the results and make the go/no-go call in the same meeting. Do not let step-projects end with 'we will discuss the results next week,' because that gap is where momentum dies and ambiguous results get reinterpreted as positive.

  • Sequence step-projects so that desirability assumptions (does anyone want this?) are tested before feasibility assumptions (can we build it?). There is no point proving technical feasibility for a feature nobody wants. The only exception is when the feasibility risk is so high that even a day of engineering investigation could save weeks of wasted user research.

Common Mistakes

Designing a step-project that is actually a full product launch in disguise

Correction

This happens when the team conflates 'testing an idea' with 'shipping the idea.' The warning sign is a step-project brief where the scope section reads like a feature spec, with production-quality design, full backend integration, monitoring, and documentation. When you see this, ask: 'What is the single assumption we are testing, and what is the cheapest way to test it?' Strip out everything that does not directly produce evidence for that assumption. A step-project that takes 10 weeks and involves the full engineering team is almost certainly testing multiple assumptions simultaneously, which means it cannot give you a clean answer on any of them.

Changing the success metric or threshold after seeing early results

Correction

This is called p-hacking in research, and it is just as damaging in product work. It typically happens when early data looks negative and the team says, 'Well, the metric we chose does not really capture what we are looking for.' The fix is to lock the metric and threshold in the step-project brief and share it with a stakeholder before starting. If you genuinely realize mid-experiment that you are measuring the wrong thing, document that insight and run a new step-project with the corrected metric. Do not retrofit the current experiment.

Running step-projects sequentially when they could run in parallel

Correction

Teams sometimes assume they must finish one step-project before starting another, even when the step-projects test different ideas or independent assumptions. This bottleneck slows learning dramatically. The rule is: step-projects for the same idea should run sequentially (because each one addresses the next-riskiest assumption, and earlier failures make later tests unnecessary). But step-projects for different ideas can and should run in parallel if you have the team capacity.

Check your step-project queue each planning cycle and look for parallelization opportunities.

Treating ambiguous results as positive and advancing to the next step-project

Correction

This is the most common mistake and the hardest to catch because it feels like optimism rather than error. It looks like this: the threshold was 15% signup rate, the result was 11%, and the team says, 'Close enough, and we think it would be higher with better design.' The problem is that experiments are designed to work in controlled conditions. If the signal is weak in a controlled test, it will be weaker in reality. Treat ambiguous results as negative unless your pre-committed decision rules explicitly define a 'gray zone' action, such as running one more test with a revised approach.

Skipping the assumption identification step and jumping straight to experiment design

Correction

This happens when the team is excited about an idea and wants to 'just try it.' Without identifying and ranking assumptions first, the experiment ends up testing whatever is easiest to measure rather than whatever is riskiest. The result is a step-project that produces a positive signal on a low-risk assumption while the high-risk assumption remains untested. Always spend 30-60 minutes listing and ranking assumptions before designing the experiment. If the team resists, remind them that the goal is to learn as fast as possible, and testing low-risk assumptions first is the slowest possible path.

Not updating the parent idea's ICE score after the step-project concludes

Correction

Step-projects exist to change the Confidence component of the ICE score. If you run a step-project and do not update the score, the evidence you gathered has no effect on future prioritization decisions. After every step-project, revisit the idea's ICE score, adjust Confidence based on the result, and re-rank the idea backlog. This is the feedback loop that makes the entire GIST system work.

Without it, step-projects become busywork.

Frequently Asked Questions

How long should a typical step-project take?

Most step-projects fall in the 2-4 week range. Assessment experiments (interviews, competitive analysis, data mining) can be completed in 1-2 weeks. Simulation experiments (fake door tests, concierge MVPs, prototype tests) typically need 2-4 weeks to collect enough data. Full MVP experiments may need 4-8 weeks, but should only be used when cheaper methods cannot test the assumption. If your step-project is approaching 10 weeks, it is almost certainly trying to test multiple assumptions and should be split.

How do I design step-projects for ideas where the product manager vs project manager roles overlap?

In organizations where one person fills both the product manager and project manager roles, the risk is defaulting to delivery planning instead of experiment design. Force yourself to complete the assumption-listing step before any scheduling or resourcing. Write the step-project brief using the experiment template (assumption, metric, threshold) rather than a project plan template (milestones, deliverables, deadlines). The brief format itself acts as a guardrail. If your document looks like a project plan, you have slipped into project management mode.

Should I design step-projects before or after ICE scoring?

After. ICE scoring happens at the idea level and determines which ideas are worth investigating. Step-projects are how you investigate the winners. Specifically, step-projects target the Confidence component of the ICE score. A low-confidence, high-impact idea is the ideal candidate for a step-project because it represents high potential value with high uncertainty. If you design step-projects before scoring, you risk investing experiment effort into ideas that are low-impact or already high-confidence. See [prioritizing ideas with ICE scoring](/skills/prioritizing-ideas-with-ice-scoring) for the scoring process.

What do I do when a step-project produces ambiguous results?

First, check your pre-committed decision rules. , 'if signup rate is between 10% and 15%, run a follow-up test with revised messaging'), follow it. If you did not define a gray zone, treat ambiguous results as negative. The reasoning is that a controlled experiment with a small, targeted audience should produce a strong signal if the underlying assumption is true. A weak signal in controlled conditions will be even weaker at scale. Document the ambiguity, note what you would change about the experiment design, and either run a refined follow-up step-project or move on to the next idea.

How many step-projects should run in parallel?

This depends on team size and the independence of the experiments. A team of 4-6 can typically run 2-3 step-projects in parallel if they test different ideas and do not require the same people. Step-projects for the same idea should always run sequentially because each one addresses the next-riskiest assumption, and a failure at any stage may make subsequent tests unnecessary. The constraint is usually people, not process. If running parallel step-projects means each one is under-resourced and takes twice as long, you are better off running them sequentially at full speed.

Why does my step-project scope keep expanding mid-experiment?

Scope creep in step-projects usually has one of three causes. ' Fix this by writing a more precise assumption statement. Second, the team conflates testing with building, and starts adding production-quality elements that are unnecessary for the experiment. ' Third, stakeholders see the experiment and request additions. Fix this by sharing the brief and decision rules with stakeholders at the start and referring back to them when requests come in.

Can I use step-projects for non-product experiments, like marketing or operations ideas?

Yes. The step-project structure works for any hypothesis that can be tested with a time-boxed experiment and a measurable outcome. Marketing teams use step-projects to test new channels ('We believe that sponsoring three niche podcasts will generate 50 qualified leads in 4 weeks at under $30 per lead'). Operations teams use them to test process changes ('We believe that switching to async standups will reduce meeting hours by 40% without increasing blocked tasks'). The structure is the same: identify the assumption, choose the cheapest experiment, define the metric and threshold, and set the time box.