DAG by ejholmes · Pull Request #406 · cloudtools/stacker

ejholmes · 2017-06-21T01:22:38Z

This PR is an internal refactor to make stacker build a Directed Acyclic Graph when planning execution.

Fixes #300
Fixes #186
Fixes #279
Fixes #384

Why use a DAG?

Performance

The DAG simplifies parallel scheduling, so that we can execute mutually exclusive stacks at the same time.

In this PR, each stack is executed in it's own thread. The thread blocks until all of the stacks dependencies have been executed, and then immediately starts. This means that stacks at the same depth, won't block each other unnecessarily.

The worst case stacker build time, is now the total time it takes for the longest critical chain (the longest chain of dependencies). Previously, the worst case stacker build time was a combination of the longest critical chain, in addition to the widest point in the graph.

For a concrete example, using the benchmark.yaml config, this PR dropped NOOP execution time from ~1min+ to ~15 seconds.

In our internal infrastructure repo, it drops NOOP execution time from ~7 minutes down to ~2.5 minutes (with my shitty RTT in Thailand).

Automatic dependency resolution for `--stacks`

stacker build supports a --stacks flag, however, in practice, it's difficult to use, because it won't automatically resolve dependencies. The DAG makes this easy, so when you provide a --stacks flag now, any dependencies are automatically visited.

Simplified destroy

stacker destroy is just a stacker build with a transposed graph.

Graceful SIGINT/SIGTERM

SIGINT and SIGTERM are handled more gracefully now. When received, stacker will wait until all current stack operations complete before shutting down.

Cyclic Dependency Checks

If there are any circular dependencies, it's detected and an error is shown. Previously, stacker would just hang forever.

Duplicate Node Detection

Previously, if you accidentally added a node with the same name, you wouldn't know about it, until it overwrote your existing stack with the new template (I've done this before with an RDS stack...). With the DAG, duplicate nodes are automatically detected.

Logging Improvements

The log output is updated whenever a step changes status, which fixes a number of issues where it would previously look like stacker wasn't doing anything (e.g. a noop run would never update the statuses).

Currently, logging is pretty jiterry because of this. I'll try to fix these issues before this gets merged.

https://asciinema.org/a/ilOwudPADDXQBnmfqa57JaC4Y

What can go wrong

Thread Safety

stacker is now multi-threaded, so we need to be careful to ensure that everything is thread safe. Not sure if there are any static analysis tools in the Python world for this.

Rate limiting

Because we're executing more API calls in parallel, it's possible for stacker to hit rate limits more easily, although, I've yet to hit any rate limits in all of my tests.

Tested

Interactive mode
Tailing
Assuming role with MFA
Diff

TODO

Add a --j/jobs flag for specifying concurrency
Figure out how to show stack traces without getting overwritten by checkpointing
Fix jitter in checkpointing
Fix SIGINT catch breaking CTL-C from MFA prompt.
Better handling of retrieving stack status updates to avoid throttling on DescribeStacks.

codecov · 2017-06-21T01:22:41Z

Codecov Report

Merging #406 into master will decrease coverage by 1.57%.
The diff coverage is 86.17%.

@@            Coverage Diff             @@
##           master     #406      +/-   ##
==========================================
- Coverage   87.85%   86.27%   -1.58%     
==========================================
  Files          86       89       +3     
  Lines        4428     4546     +118     
==========================================
+ Hits         3890     3922      +32     
- Misses        538      624      +86

Impacted Files	Coverage Δ
stacker/tests/test_context.py	`100% <ø> (ø)`	⬆️
stacker/exceptions.py	`90.9% <100%> (-1.49%)`	⬇️
stacker/tests/test_dag.py	`100% <100%> (ø)`
stacker/context.py	`98.48% <100%> (-0.07%)`	⬇️
stacker/tests/actions/test_build.py	`96.17% <100%> (-0.3%)`	⬇️
stacker/tests/actions/test_destroy.py	`100% <100%> (+1.13%)`	⬆️
stacker/status.py	`97.61% <100%> (+0.64%)`	⬆️
stacker/commands/stacker/destroy.py	`75% <25%> (ø)`	⬆️
stacker/commands/stacker/base.py	`76.82% <30.76%> (-8.68%)`	⬇️
stacker/actions/diff.py	`45.09% <40%> (+3.35%)`	⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 106dbc6...e16fa11. Read the comment docs.

ejholmes · 2017-06-21T01:25:35Z

+def check_point_fn():
+    """Adds a check_point function to each of the given steps."""
+
+    lock = threading.Lock()


I think the jitter comes from the lock here. Checkpointing should be throttled, not locked, so that multiple status changes in short succession don't spam the terminal.

ejholmes · 2017-06-21T01:35:48Z

+                transposed.add_edge(edge, node)
+        return transposed
+
+    def walk(self, walk_func, graph=None):


This probably deserves some special attention when reviewed. It's the core algorithm for implementing a parallel walk using threads that block on each other.

ejholmes · 2017-06-21T01:37:49Z

+    pass
+
+
+class DAG(object):


FYI, this implementation is pulled from https://github.com/thieman/py-dag with a few changes (adds walk and transpose methods). Since it's small, I just pulled it in directly so we don't need to add an extra dependency.

Even though we plan to pull this in, do you think a PR to the origin would be worth while with the additional walk and transpose methods?

I think transpose is generic enough to contribute back. walk is somewhat specific, being threaded and all (in reality, I should probably pull it off of DAG entirely and just make it a function that takes a DAG as an argument).

ejholmes · 2017-06-21T01:45:03Z


+    def semaphore(self, options):
+        if options.interactive:
+            return threading.Semaphore(1)


Currently, dropping down to interactive mode removes parallelism, which isn't actually necessary. It would be better to add locks around the change set prompts.

ejholmes · 2017-06-21T22:44:15Z

One annoying thing about going to threading, is that if a thread throws an exception, the stack trace can get overwritten by checkpointing:

https://asciinema.org/a/EUd6jUK8ogTqjsm0CgjMgtXDB

ejholmes · 2017-06-22T02:14:21Z


-    def _generate_plan(self, tail=False):
-        plan_kwargs = {}
-        if tail:


I'm not really sure what to do about the --tail flag. From playing with it on master, it doesn't seem to really work very well (it constantly gets overwritten by checkpoint'ing, see #386).

Since the DAG will likely go into 1.1, maybe we should pull --tail out?

If we want to keep it in, we can make it work, but we should probably fix it in 1.0.x first. Maybe if --tail is specified, disable the loop logger.

danielkza · 2017-07-09T20:01:01Z

Apologies if it is too late given the seemly advanced stage of this PR, but were any other approaches considered other than explicitly building a DAG and walking it in multiple threads?

I've recently had a similar use case, and chose to implement it using Python 3's asyncio with "implicit" dependencies defined by waiting on futures corresponding to parents ' builds instead of explicitly walking the graph (which has the secondary advantage of making data propagation to children easier). Concurrency is limited by waiting on a semaphore before the first stage of an actual build (but only after waiting for dependencies).

While asyncio is not an option with Python 2, and probably a bit too large of a change for stacker as a whole, a very similar approach can be implemented with threads and events.

The interesting part of that implementation can be seen here (apologies for the less-than-prod-quality code).

ejholmes · 2017-07-10T23:43:20Z

@danielkza thanks for the feedback!

I think there's probably two things here that are orthogonal.

Building a graph before execution.
Whether or not to use threads for parallelization. The current implementation here uses threads to walk the graph, using an algorithm very similar to what you described above, but could be implemented using asyncio I think (if we ditched python2 support).

There's a couple of advantages that I see of building an explicit graph vs implicitly waiting on outputs:

When we build a graph, it makes introspection of dependencies really easy. We can output a DOT formatted graph so you can see how all the stacks relate to each other.
It makes the destroy action really simple, since we just need to take the graph and then transpose it.
We can detect cyclic dependencies.

So, with that in place, then the question comes down to whether or not to use threading to walk the graph.

I'm definitely not married to the idea of using threads for parallelization here, since it's pretty resource inefficient (we're just waiting on I/O to complete). I'm not sure what other options there are for python2 that would give us the same level of performance though.

danielkza · 2017-07-11T01:52:20Z

@ejholmes Yeah, I figure the explicit DAC has some clear advantages. But I still think using futures for "flow control" would be preferable to checkpointing and explicit state monitoring. They make waiting relationships much more clear, likely easier to debug an test, and can more easily work with a fixed pool of resources instead of creating an arbitrary number of threads.

I will play with the idea a bit during this week and see if it actually makes sense. At worse I'll learn why it doesn't work and maybe get some insights.

phobologic · 2017-07-12T22:08:28Z

Looks like someone ported asyncio to 2.7 - though YMMV: https://pypi.python.org/pypi/trollius

ejholmes · 2018-01-26T05:18:17Z

Closing this in favor of #523.

ejholmes added 2 commits June 20, 2017 17:56

Benchmarks

d9bb71f

Use a DAG for plan execution.

cb20275

ejholmes requested a review from phobologic June 21, 2017 01:22

ejholmes commented Jun 21, 2017

View reviewed changes

Set step status as cancelled (errored) if the step raises an exception

29b883d

ejholmes force-pushed the dag-1.0 branch from c6fded7 to 29b883d Compare June 21, 2017 22:49

ejholmes added 2 commits June 21, 2017 17:24

Pretty errors

e7e7e0c

Make --stacks flag work as expected (build dependencies)

3580c5b

ejholmes commented Jun 22, 2017

View reviewed changes

Log exception, and show it when not in the loop logger

e16fa11

ejholmes mentioned this pull request Jan 26, 2018

DAG #523

Merged

4 tasks

ejholmes closed this Jan 26, 2018

Conversation

ejholmes commented Jun 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why use a DAG?

Performance

Automatic dependency resolution for --stacks

Simplified destroy

Graceful SIGINT/SIGTERM

Cyclic Dependency Checks

Duplicate Node Detection

Logging Improvements

What can go wrong

Thread Safety

Rate limiting

Tested

TODO

Uh oh!

codecov Bot commented Jun 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ejholmes Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

ejholmes Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

ejholmes Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

russellballestrini Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

ejholmes Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

ejholmes Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

ejholmes commented Jun 21, 2017

Uh oh!

ejholmes Jun 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielkza commented Jul 9, 2017

Uh oh!

ejholmes commented Jul 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielkza commented Jul 11, 2017

Uh oh!

phobologic commented Jul 12, 2017

Uh oh!

ejholmes commented Jan 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ejholmes commented Jun 21, 2017 •

edited

Loading

Automatic dependency resolution for `--stacks`

codecov Bot commented Jun 21, 2017 •

edited

Loading

ejholmes Jun 22, 2017 •

edited

Loading

ejholmes commented Jul 10, 2017 •

edited

Loading