Skip to content

DAG#406

Closed
ejholmes wants to merge 6 commits into
masterfrom
dag-1.0
Closed

DAG#406
ejholmes wants to merge 6 commits into
masterfrom
dag-1.0

Conversation

@ejholmes

@ejholmes ejholmes commented Jun 21, 2017

Copy link
Copy Markdown
Contributor

This PR is an internal refactor to make stacker build a Directed Acyclic Graph when planning execution.

Fixes #300
Fixes #186
Fixes #279
Fixes #384

Why use a DAG?

Performance

The DAG simplifies parallel scheduling, so that we can execute mutually exclusive stacks at the same time.

In this PR, each stack is executed in it's own thread. The thread blocks until all of the stacks dependencies have been executed, and then immediately starts. This means that stacks at the same depth, won't block each other unnecessarily.

The worst case stacker build time, is now the total time it takes for the longest critical chain (the longest chain of dependencies). Previously, the worst case stacker build time was a combination of the longest critical chain, in addition to the widest point in the graph.

For a concrete example, using the benchmark.yaml config, this PR dropped NOOP execution time from ~1min+ to ~15 seconds.

In our internal infrastructure repo, it drops NOOP execution time from ~7 minutes down to ~2.5 minutes (with my shitty RTT in Thailand).

Automatic dependency resolution for --stacks

stacker build supports a --stacks flag, however, in practice, it's difficult to use, because it won't automatically resolve dependencies. The DAG makes this easy, so when you provide a --stacks flag now, any dependencies are automatically visited.

Simplified destroy

stacker destroy is just a stacker build with a transposed graph.

Graceful SIGINT/SIGTERM

SIGINT and SIGTERM are handled more gracefully now. When received, stacker will wait until all current stack operations complete before shutting down.

Cyclic Dependency Checks

If there are any circular dependencies, it's detected and an error is shown. Previously, stacker would just hang forever.

Duplicate Node Detection

Previously, if you accidentally added a node with the same name, you wouldn't know about it, until it overwrote your existing stack with the new template (I've done this before with an RDS stack...). With the DAG, duplicate nodes are automatically detected.

Logging Improvements

The log output is updated whenever a step changes status, which fixes a number of issues where it would previously look like stacker wasn't doing anything (e.g. a noop run would never update the statuses).

Currently, logging is pretty jiterry because of this. I'll try to fix these issues before this gets merged.

https://asciinema.org/a/ilOwudPADDXQBnmfqa57JaC4Y

What can go wrong

Thread Safety

stacker is now multi-threaded, so we need to be careful to ensure that everything is thread safe. Not sure if there are any static analysis tools in the Python world for this.

Rate limiting

Because we're executing more API calls in parallel, it's possible for stacker to hit rate limits more easily, although, I've yet to hit any rate limits in all of my tests.


Tested

  • Interactive mode
  • Tailing
  • Assuming role with MFA
  • Diff

TODO

  • Add a --j/jobs flag for specifying concurrency
  • Figure out how to show stack traces without getting overwritten by checkpointing
  • Fix jitter in checkpointing
  • Fix SIGINT catch breaking CTL-C from MFA prompt.
  • Better handling of retrieving stack status updates to avoid throttling on DescribeStacks.

@ejholmes ejholmes requested a review from phobologic June 21, 2017 01:22
@codecov

codecov Bot commented Jun 21, 2017

Copy link
Copy Markdown

Codecov Report

Merging #406 into master will decrease coverage by 1.57%.
The diff coverage is 86.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #406      +/-   ##
==========================================
- Coverage   87.85%   86.27%   -1.58%     
==========================================
  Files          86       89       +3     
  Lines        4428     4546     +118     
==========================================
+ Hits         3890     3922      +32     
- Misses        538      624      +86
Impacted Files Coverage Δ
stacker/tests/test_context.py 100% <ø> (ø) ⬆️
stacker/exceptions.py 90.9% <100%> (-1.49%) ⬇️
stacker/tests/test_dag.py 100% <100%> (ø)
stacker/context.py 98.48% <100%> (-0.07%) ⬇️
stacker/tests/actions/test_build.py 96.17% <100%> (-0.3%) ⬇️
stacker/tests/actions/test_destroy.py 100% <100%> (+1.13%) ⬆️
stacker/status.py 97.61% <100%> (+0.64%) ⬆️
stacker/commands/stacker/destroy.py 75% <25%> (ø) ⬆️
stacker/commands/stacker/base.py 76.82% <30.76%> (-8.68%) ⬇️
stacker/actions/diff.py 45.09% <40%> (+3.35%) ⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 106dbc6...e16fa11. Read the comment docs.

Comment thread stacker/actions/base.py
def check_point_fn():
"""Adds a check_point function to each of the given steps."""

lock = threading.Lock()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the jitter comes from the lock here. Checkpointing should be throttled, not locked, so that multiple status changes in short succession don't spam the terminal.

Comment thread stacker/dag/__init__.py
transposed.add_edge(edge, node)
return transposed

def walk(self, walk_func, graph=None):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably deserves some special attention when reviewed. It's the core algorithm for implementing a parallel walk using threads that block on each other.

Comment thread stacker/dag/__init__.py
pass


class DAG(object):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this implementation is pulled from https://github.com/thieman/py-dag with a few changes (adds walk and transpose methods). Since it's small, I just pulled it in directly so we don't need to add an extra dependency.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we plan to pull this in, do you think a PR to the origin would be worth while with the additional walk and transpose methods?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think transpose is generic enough to contribute back. walk is somewhat specific, being threaded and all (in reality, I should probably pull it off of DAG entirely and just make it a function that takes a DAG as an argument).


def semaphore(self, options):
if options.interactive:
return threading.Semaphore(1)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, dropping down to interactive mode removes parallelism, which isn't actually necessary. It would be better to add locks around the change set prompts.

@ejholmes

Copy link
Copy Markdown
Contributor Author

One annoying thing about going to threading, is that if a thread throws an exception, the stack trace can get overwritten by checkpointing:

https://asciinema.org/a/EUd6jUK8ogTqjsm0CgjMgtXDB

Comment thread stacker/actions/build.py

def _generate_plan(self, tail=False):
plan_kwargs = {}
if tail:

@ejholmes ejholmes Jun 22, 2017

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what to do about the --tail flag. From playing with it on master, it doesn't seem to really work very well (it constantly gets overwritten by checkpoint'ing, see #386).

Since the DAG will likely go into 1.1, maybe we should pull --tail out?

If we want to keep it in, we can make it work, but we should probably fix it in 1.0.x first. Maybe if --tail is specified, disable the loop logger.

@danielkza

Copy link
Copy Markdown
Contributor

Apologies if it is too late given the seemly advanced stage of this PR, but were any other approaches considered other than explicitly building a DAG and walking it in multiple threads?

I've recently had a similar use case, and chose to implement it using Python 3's asyncio with "implicit" dependencies defined by waiting on futures corresponding to parents ' builds instead of explicitly walking the graph (which has the secondary advantage of making data propagation to children easier). Concurrency is limited by waiting on a semaphore before the first stage of an actual build (but only after waiting for dependencies).

While asyncio is not an option with Python 2, and probably a bit too large of a change for stacker as a whole, a very similar approach can be implemented with threads and events.

The interesting part of that implementation can be seen here (apologies for the less-than-prod-quality code).

@ejholmes

ejholmes commented Jul 10, 2017

Copy link
Copy Markdown
Contributor Author

@danielkza thanks for the feedback!

I think there's probably two things here that are orthogonal.

  1. Building a graph before execution.
  2. Whether or not to use threads for parallelization. The current implementation here uses threads to walk the graph, using an algorithm very similar to what you described above, but could be implemented using asyncio I think (if we ditched python2 support).

There's a couple of advantages that I see of building an explicit graph vs implicitly waiting on outputs:

  1. When we build a graph, it makes introspection of dependencies really easy. We can output a DOT formatted graph so you can see how all the stacks relate to each other.
  2. It makes the destroy action really simple, since we just need to take the graph and then transpose it.
  3. We can detect cyclic dependencies.

So, with that in place, then the question comes down to whether or not to use threading to walk the graph.

I'm definitely not married to the idea of using threads for parallelization here, since it's pretty resource inefficient (we're just waiting on I/O to complete). I'm not sure what other options there are for python2 that would give us the same level of performance though.

@danielkza

Copy link
Copy Markdown
Contributor

@ejholmes Yeah, I figure the explicit DAC has some clear advantages. But I still think using futures for "flow control" would be preferable to checkpointing and explicit state monitoring. They make waiting relationships much more clear, likely easier to debug an test, and can more easily work with a fixed pool of resources instead of creating an arbitrary number of threads.

I will play with the idea a bit during this week and see if it actually makes sense. At worse I'll learn why it doesn't work and maybe get some insights.

@phobologic

Copy link
Copy Markdown
Member

Looks like someone ported asyncio to 2.7 - though YMMV: https://pypi.python.org/pypi/trollius

@ejholmes ejholmes mentioned this pull request Jan 26, 2018
Merged
4 tasks
@ejholmes

Copy link
Copy Markdown
Contributor Author

Closing this in favor of #523.

@ejholmes ejholmes closed this Jan 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot destroy individual stack with requires handle dependency trees within stacker's purview General slowness Allow for sub-tree provisioning

4 participants