Skip to content

Comparing array subset checks#125

Closed
gabteles wants to merge 2 commits into
fastruby:mainfrom
gabteles:master
Closed

Comparing array subset checks#125
gabteles wants to merge 2 commits into
fastruby:mainfrom
gabteles:master

Conversation

@gabteles

Copy link
Copy Markdown

No description provided.

@mblumtritt

mblumtritt commented May 16, 2017

Copy link
Copy Markdown

Your code above is massively dependent to used sample arrays. Using this samples

ARRAY1 = [*(400..500)].freeze
ARRAY2 = [*(1..500)].freeze

will change your results completely. I would prefer to use a more real-world sample for the test to avoid misleadings…

@Arcovion

Arcovion commented May 16, 2017

Copy link
Copy Markdown
Collaborator

Maybe (ARRAY1 & ARRAY2).size == ARRAY1.size is faster - also try the set library; convert both #to_set and check with #subset?

@gabteles

Copy link
Copy Markdown
Author

@mblumtritt What you suggest as a real world example? It would be nice if I generate a random-filled array of integers and let it static in the test or something like that?

@Arcovion Nice, I'll add both.

Thank you guys. (:

@mblumtritt

Copy link
Copy Markdown

@gabteles Do you mean something like this?

ARRAY1 = [*(400..500)].shuffle.freeze
ARRAY2 = [*(1..500)].shuffle.freeze

Then we have to other problem: how is the weight between the set and the subset 🤔 Or do you prefer a complete random set to compare?…

In "real world" I would check my chances to be faster with a complete different algorithm. For sample what about sorting first descending? This might be faster if both sets already nearly sorted…

I'm unsure to give a "general advice" for this subset problem. In "real world" it always depends… ;)

@gabteles

gabteles commented Jun 1, 2017

Copy link
Copy Markdown
Author

@mblumtritt Yeah, I agree that the weight between set/subset is a big problem to benchmark. I'm not sure how to handle it.

Before I meant something like this:

  • Generate random arrays (generated with small size to better readability in this example, test could have bigger arrays):
$ irb
2.4.1 :001 > 10.times.map { rand(100) }
 => [20, 83, 86, 53, 63, 52, 50, 43, 18, 94] 
2.4.1 :002 > 10.times.map { rand(100) }
 => [27, 93, 84, 86, 71, 89, 38, 48, 37, 22]
  • Statically use it in code:
ARRAY1 = [20, 83, 86, 53, 63, 52, 50, 43, 18, 94].freeze
ARRAY2 = [27, 93, 84, 86, 71, 89, 38, 48, 37, 22].freeze

Using it statically grants that we won't have performance improved/degraded by randomness effect of shuffle or rand in runtime. But it also suffers of the weight problem.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new benchmark to compare different ways of checking whether one Ruby array is a subset of another, and documents the results in the main README alongside existing performance comparisons.

Changes:

  • Added a new benchmark script for array subset detection methods (code/array/subset-checking.rb).
  • Added a README section documenting the benchmark command and sample results (README.md).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
README.md Adds a new documented benchmark section and sample output for array subset detection.
code/array/subset-checking.rb Introduces the benchmark implementation comparing multiple subset-check approaches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ARRAY2 = [*1..100]

def slow_set
ARRAY2.to_set.subset?(ARRAY1.to_set)
Comment on lines +11 to +31
def slow_interception
(ARRAY1 & ARRAY2) == ARRAY1
end

def slow_interception_size
(ARRAY1 & ARRAY2).size == ARRAY1.size
end

def slow_minus_empty
(ARRAY1 - ARRAY2).empty?
end

def fast
ARRAY1.all?{|element| ARRAY2.include?(element) }
end

Benchmark.ips do |x|
x.report("(a1 - a2).empty?") { slow_minus_empty }
x.report("(a1 & a2) == a1") { slow_interception }
x.report("Array#all?#include?") { fast }
x.report("Array#&#size") { slow_interception_size }
end

def fast
ARRAY1.all?{|element| ARRAY2.include?(element) }
Comment thread README.md
Comment on lines +392 to +396
##### Subset detection with various methods [code](code/array/subset-checking.rb)

```
$ ruby -v code/array/subset-checking.rb
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
@etagwerker

Copy link
Copy Markdown
Member

Thanks for this, @gabteles, and thanks to @mblumtritt and @Arcovion for the thoughtful review back in 2017. I revisited this on Ruby 4.0.0 to see whether it still holds up, and unfortunately the concern @mblumtritt raised is exactly what sinks it. Going to close it, but I want to lay out the reasoning so it's useful for anyone who finds this later.

1. The conclusion no longer holds on Ruby 4.0

Running the benchmark verbatim, the ranking has flipped and the gaps have collapsed:

Method Ruby 2.4.1 (this PR) Ruby 4.0.0
(a1 - a2).empty? 1.41x slower fastest
Array#&#size 1.66x slower 1.08x slower
Array#all?#include? fastest 1.09x slower
(a1 & a2) == a1 1.72x slower 1.15x slower
Set#subset? 6.73x slower 1.54x slower

Array#all?#include? went from clear winner to mid-pack, and the spread shrank from ~6.7x to ~1.5x.

2. The Set#subset? case has a bug

The arguments are reversed:

ARRAY2.to_set.subset?(ARRAY1.to_set)   # checks if the BIG array is a subset of the small one

It returns false while every other method returns true, so that row was never measuring the same thing. It also builds the sets inside the benchmark loop, so it's really timing to_set conversion rather than the subset check.

3. The real problem: there is no stable winner (the 2017 concern, confirmed)

@mblumtritt's point that the result is "massively dependent on the used sample arrays" is fully borne out on Ruby 4.0. The fastest method changes completely with the input:

  • True subset, 25/100 (this PR's case): (a1 - a2).empty? wins; all?#include? is mid-pack.
  • Not a subset, first element missing: all?#include? is ~6x faster than everything else, because all? short-circuits on the first miss.
  • Larger true subset, 250/1000: all?#include? is 44.9x slower. It's O(n·m), a Ruby-level block calling an O(m) include? per element, so it falls off a cliff as the arrays grow.
  • Set#subset? with the sets built once (outside the loop, args correct) is the fastest approach in every scenario, 6.5x–42x faster, because it's O(n) hash lookups and scales.

fast-ruby works best when it can give a clear "do X, not Y." This comparison has no robust winner, has a correctness bug, and as written penalizes the approach that's actually fastest at scale, so I'd rather not enshrine it as an idiom.

The genuinely useful takeaway: if you're doing subset checks (especially repeatedly or on large arrays), convert to Set once and use Set#subset?. (a1 - a2).empty? is the most consistent array-only option; all?#include? only wins when it can bail out early on a non-subset.

Closing for those reasons, but thank you all for the contribution and the discussion. It aged into a nice case study of why benchmarks need realistic, varied inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants