Conversation
Codecov Report
@@ Coverage Diff @@
## master #406 +/- ##
==========================================
- Coverage 87.85% 86.27% -1.58%
==========================================
Files 86 89 +3
Lines 4428 4546 +118
==========================================
+ Hits 3890 3922 +32
- Misses 538 624 +86
Continue to review full report at Codecov.
|
| def check_point_fn(): | ||
| """Adds a check_point function to each of the given steps.""" | ||
|
|
||
| lock = threading.Lock() |
There was a problem hiding this comment.
I think the jitter comes from the lock here. Checkpointing should be throttled, not locked, so that multiple status changes in short succession don't spam the terminal.
| transposed.add_edge(edge, node) | ||
| return transposed | ||
|
|
||
| def walk(self, walk_func, graph=None): |
There was a problem hiding this comment.
This probably deserves some special attention when reviewed. It's the core algorithm for implementing a parallel walk using threads that block on each other.
| pass | ||
|
|
||
|
|
||
| class DAG(object): |
There was a problem hiding this comment.
FYI, this implementation is pulled from https://github.com/thieman/py-dag with a few changes (adds walk and transpose methods). Since it's small, I just pulled it in directly so we don't need to add an extra dependency.
There was a problem hiding this comment.
Even though we plan to pull this in, do you think a PR to the origin would be worth while with the additional walk and transpose methods?
There was a problem hiding this comment.
I think transpose is generic enough to contribute back. walk is somewhat specific, being threaded and all (in reality, I should probably pull it off of DAG entirely and just make it a function that takes a DAG as an argument).
|
|
||
| def semaphore(self, options): | ||
| if options.interactive: | ||
| return threading.Semaphore(1) |
There was a problem hiding this comment.
Currently, dropping down to interactive mode removes parallelism, which isn't actually necessary. It would be better to add locks around the change set prompts.
|
One annoying thing about going to threading, is that if a thread throws an exception, the stack trace can get overwritten by checkpointing: |
|
|
||
| def _generate_plan(self, tail=False): | ||
| plan_kwargs = {} | ||
| if tail: |
There was a problem hiding this comment.
I'm not really sure what to do about the --tail flag. From playing with it on master, it doesn't seem to really work very well (it constantly gets overwritten by checkpoint'ing, see #386).
Since the DAG will likely go into 1.1, maybe we should pull --tail out?
If we want to keep it in, we can make it work, but we should probably fix it in 1.0.x first. Maybe if --tail is specified, disable the loop logger.
|
Apologies if it is too late given the seemly advanced stage of this PR, but were any other approaches considered other than explicitly building a DAG and walking it in multiple threads? I've recently had a similar use case, and chose to implement it using Python 3's While The interesting part of that implementation can be seen here (apologies for the less-than-prod-quality code). |
|
@danielkza thanks for the feedback! I think there's probably two things here that are orthogonal.
There's a couple of advantages that I see of building an explicit graph vs implicitly waiting on outputs:
So, with that in place, then the question comes down to whether or not to use threading to walk the graph. I'm definitely not married to the idea of using threads for parallelization here, since it's pretty resource inefficient (we're just waiting on I/O to complete). I'm not sure what other options there are for python2 that would give us the same level of performance though. |
|
@ejholmes Yeah, I figure the explicit DAC has some clear advantages. But I still think using futures for "flow control" would be preferable to checkpointing and explicit state monitoring. They make waiting relationships much more clear, likely easier to debug an test, and can more easily work with a fixed pool of resources instead of creating an arbitrary number of threads. I will play with the idea a bit during this week and see if it actually makes sense. At worse I'll learn why it doesn't work and maybe get some insights. |
|
Looks like someone ported asyncio to 2.7 - though YMMV: https://pypi.python.org/pypi/trollius |
|
Closing this in favor of #523. |
This PR is an internal refactor to make stacker build a Directed Acyclic Graph when planning execution.
Fixes #300
Fixes #186
Fixes #279
Fixes #384
Why use a DAG?
Performance
The DAG simplifies parallel scheduling, so that we can execute mutually exclusive stacks at the same time.
In this PR, each stack is executed in it's own thread. The thread blocks until all of the stacks dependencies have been executed, and then immediately starts. This means that stacks at the same depth, won't block each other unnecessarily.
The worst case
stacker buildtime, is now the total time it takes for the longest critical chain (the longest chain of dependencies). Previously, the worst casestacker buildtime was a combination of the longest critical chain, in addition to the widest point in the graph.For a concrete example, using the
benchmark.yamlconfig, this PR dropped NOOP execution time from ~1min+ to ~15 seconds.In our internal infrastructure repo, it drops NOOP execution time from ~7 minutes down to ~2.5 minutes (with my shitty RTT in Thailand).
Automatic dependency resolution for
--stacksstacker buildsupports a--stacksflag, however, in practice, it's difficult to use, because it won't automatically resolve dependencies. The DAG makes this easy, so when you provide a--stacksflag now, any dependencies are automatically visited.Simplified destroy
stacker destroyis just astacker buildwith a transposed graph.Graceful SIGINT/SIGTERM
SIGINT and SIGTERM are handled more gracefully now. When received, stacker will wait until all current stack operations complete before shutting down.
Cyclic Dependency Checks
If there are any circular dependencies, it's detected and an error is shown. Previously, stacker would just hang forever.
Duplicate Node Detection
Previously, if you accidentally added a node with the same name, you wouldn't know about it, until it overwrote your existing stack with the new template (I've done this before with an RDS stack...). With the DAG, duplicate nodes are automatically detected.
Logging Improvements
The log output is updated whenever a step changes status, which fixes a number of issues where it would previously look like stacker wasn't doing anything (e.g. a noop run would never update the statuses).
Currently, logging is pretty jiterry because of this. I'll try to fix these issues before this gets merged.
https://asciinema.org/a/ilOwudPADDXQBnmfqa57JaC4Y
What can go wrong
Thread Safety
stacker is now multi-threaded, so we need to be careful to ensure that everything is thread safe. Not sure if there are any static analysis tools in the Python world for this.
Rate limiting
Because we're executing more API calls in parallel, it's possible for stacker to hit rate limits more easily, although, I've yet to hit any rate limits in all of my tests.
Tested
TODO
--j/jobsflag for specifying concurrency