Skip to content

Add support for delta callback API.#209

Open
dbaarda wants to merge 18 commits into
librsync:masterfrom
dbaarda:add/callback-api1
Open

Add support for delta callback API.#209
dbaarda wants to merge 18 commits into
librsync:masterfrom
dbaarda:add/callback-api1

Conversation

@dbaarda

@dbaarda dbaarda commented Jun 11, 2020

Copy link
Copy Markdown
Member

This is in a pretty rough initial state, but it does work and passes all
tests. It's far from optimized, and only changes deltas to the new API. A
bunch of stuff is yet to be deprecated away pending other job types using this
API too.

Add deltagen.[ch] which defines and implements the delta generator callback
API. The default rs_deltagen_t generates deltas identical to the current
release. Add some rs_send_t data writers for files, buffers, and jobs. Also
includes an unused rs_bufout_t data writer that buffers output.

Change job.h to add fields needed by the delta generator API. Also change all
match/miss length fields to int to reflect that the lengths are constrained
to less than int-sized lengths.

Change delta.c to use an rs_deltagen_t delta generator, fixing it to correctly
flush and not rely on iterator calls to invoke rs_tube_catchup() between
state-function calls. Also make it never accumulte more than
MAX_DATA_LEN=32768 of hits as well as misses before invoking the the callback
API, since we now keep and pass match data to the API as well.

Change emit.[ch] to emit cmds into buffers, not directly to the tube. We need
this so cmds can be sent to different delta-generator backends, not just to
the job stream via the tube.

In netint.[ch] add rs_put_netint() to write netints to a buffer, not directly
to the tube. This is now used by emit.c.

This is in a pretty rough initial state, but it does work and passes all
tests. It's far from optimized, and only changes deltas to the new API. A
bunch of stuff is yet to be deprecated away pending other job types using this
API too.

Add deltagen.[ch] which defines and implements the delta generator callback
API. The default rs_deltagen_t generates deltas identical to the current
release. Add some rs_send_t data writers for files, buffers, and jobs. Also
includes an unused rs_bufout_t data writer that buffers output.

Change job.h to add fields needed by the delta generator API. Also change all
match/miss length fields to `int` to reflect that the lengths are constrained
to less than int-sized lengths.

Change delta.c to use an rs_deltagen_t delta generator, fixing it to correctly
flush and not rely on iterator calls to invoke rs_tube_catchup() between
state-function calls. Also make it never accumulte more than
MAX_DATA_LEN=32768 of hits as well as misses before invoking the the callback
API, since we now keep and pass match data to the API as well.

Change emit.[ch] to emit cmds into buffers, not directly to the tube. We need
this so cmds can be sent to different delta-generator backends, not just to
the job stream via the tube.

In netint.[ch] add rs_put_netint() to write netints to a buffer, not directly
to the tube. This is now used by emit.c.
@dbaarda

dbaarda commented Jun 12, 2020

Copy link
Copy Markdown
Member Author

Some random thoughts on this.

There are some fundamental assumptions in the librsync implementation about input and output being byte-streams going to/from buffers_t stream objects that are incompatible with this. The delta callbacks API means the output type depends on the delta generator, and might not even be data; it could be a sequence of callbacks that trigger other actions.

The assumptions about buffers_t streaming input/output lead to the scoop (for input) and tube (for output). The tube is the assumed output for things like netint.[ch] rs_squirt_*() methods, and emit.[ch] rs_emit_*() methods. The tube also has support for directly copying from the scoop. The job.c rs_job_iter() iterator assumes output goes to the tube and handles flushing it before each iteration (in rs_job_work()).

The scoop is not too bad, but could use some improvements to reduce the amount of copying from input to the scoop. Delta currently requires contiguous data to scan through, so copies all input into the scoop before processing it. It would be possible to change this with a better scoop abstraction layer that could "skip" over the discontinuity between input buffers and the scoop buffer, so that only the tail end of the input buffer (up to 1 blocklen) would need to be copied into the scoop. However, it's not clear that the performance benefits of avoiding the copy would outweigh the performance cost of the abstraction layer to handle the input/scoop discontinuity.

The tube is however completely incompatible with the generator callback idea, and this pull request changes delta.c to not use it at all. Instead the deltagen_t generator uses a rs_sent_t() callback for writing the data. This approach could be more efficient than the tube's feature for copying directly from the scoop, allowing the scoop to be directly written to a file instead of copied to a buffer before being written to a file.

So;

  1. the scoop architecture is OK but ideally should be better abstracted and perhaps not assume an rs_buffers_t input, but provide an rs_send_t() callback method for filling it.

  2. The tube architecture is not OK, and should be replaced with using rs_send_t() callback for output. This would mean all the netint rs_squirt_*() would be obsolete.

dbaarda added 6 commits June 12, 2020 11:22
It was an interesting idea, but it turned out to not be useful.
This removes mention of the tube, which delta.c nolonger uses.
Removes more mentions of the tube, which delta.c doesn't use any more.
…a file.

This means `rs_delta_file()` bypasses the job stream output buffer and
directly writes to the output file using `rs_file_send()`.
This is a bit rough and not used yet.
@dbaarda

dbaarda commented Sep 3, 2021

Copy link
Copy Markdown
Member Author

Thinking aloud about API's.

The current "streaming API" is based on a design where you give the command input and output buffers to read/write data from. This copies many other "streaming API" designs like zlib,zstd,etc. The idea behind this API is it outsources buffer management to the client, freeing the library to just process input and generate output. This has several inefficiencies;

  1. Sometimes there is insufficient input data to make progress. This requires we either block until the user provides a larger/fuller input buffer, or we need to copy the input into an internal buffer so we can accumulate enough data to progress.
  2. If we don't copy into an internal buffer, there is always going to be a tail of data left in the input buffer, and the user is going to have to manage shuffling it to the front of a new buffer and appending more data.
  3. If we do copy into an internal buffer, requirements for contiguous data mean we may need to copy all the input data into our internal buffers for processing anyway (currently delta's do this).
  4. If we do keep an internal buffer, contiguous data requirements means the tail of data left in our internal buffer will also need to be shuffled back to the front.
  5. If input data is to be directly sent to output, we need to copy from the input (or internal) buffer into the output buffer.
  6. Sometimes there might be insufficient space in the output buffer to make progress, requiring us to block until the client provides sufficient output space.

Note that the actual details of how this is implemented can be different (buffer structures, callbacks, etc), but ultimately comes down to the same thing; buffers of data in and buffers of data out. What can be changed is who manages and provides the buffers.

For 1. and 2. I think keeping an internal buffer is the right solution. Not having one basically burdens the client with the task of implementing/managing one, and in practice we would need to provide a higher-level API that does this for them. Note that this completely undermines the original API objective of outsourcing buffer management to the client.

One solution to 3. is to have librsync provide the input buffer for the client to write data into. This way we tell the client where in our internal buffer to directly write the input data, avoiding any copying to internal buffers.

Removing the requirement for contiguous data also solves 3. and 4. This would allow use of circular buffers and/or processing a mixture of internal and external buffers without copying between them. However, this complicates the internal API significantly, with every *buf, len argument needing to be replaced with some kind of *dataobj, pos, len data object with an API that allows accessing/processing the data as some kind of sequence of non-contiguous data buffers. The overheads of this kind of API might actually be worse than just copying the data into contiguous buffers.

A solution for 5. is to have librsync provide the output buffer for the client to read data from. This way we can point directly at the input (or internal) buffer when input data goes directly to the output. Note that librsync is kinda unusual in that input data often does go directly to the output. Most other applications like compressors etc output is rarely/never copied directly from the input, and when/if we add compression to librsync this will also go away.

Currently I don't think 6. ever happens; as long as there is any space in the output buffer, progress can be made. I think this is also applies to nearly every other compressor/etc that uses this kind of API.

Finally, it would be good if this kind of API was used throughout, not just for external but also internal interfaces when chaining internal pieces like delta calculation, compression, etc together. Building pipelines should be easy and efficient, so the output API should be easy to plug directly into the input API without requiring extra data copies. This implies that if librsync provides the input buffer, it should require the client to provide the output buffer, or vice-versa. The alternative for pipelines would be to always provide both input and output buffers, and use "double adaptors" that copy output buffers to input buffers for pipelines. However, these pipeline copies could be worse than using internal copies and copy-less pipe-lining.

This is all leaning towards librsync should provide the input buffer, and the client should provide the output buffer. Note this is the opposite of what this PR currently implements.

This PR's current "push callback" approach has the clients pass input buffers to a write callback for processing, which triggers calls to another write callback that is passed the output buffers. This has a nice elegance to it, as you drive the whole pipeline by writing into it. This also gives nice efficient writes to output from the internal input buffers, but results in costly copies from external to internal input buffers. The wins from this approach go away when we add compression.

Flipping it around would mean you have a read call back for sucking output data into buffers, which triggers calls to another read callback to suck input data directly into the internal buffers.

Another example to look at is the ztd's "bufferless streaming API";

https://github.com/facebook/zstd/blob/6b4a3e019f8eeb3423065f7b24d790358e8cbc59/lib/zstd.h#L1848

This requires the user to provide both input and output buffers, which is the same as the normal streaming API, but it keeps no internal buffer and always consumes the whole input buffer, returning how much data was written to the output buffer. This has strict requirements on the sizing of input and output buffers. I'm not sure if this option is feasible or any sort of improvement for librsync.

@dbaarda

dbaarda commented Sep 3, 2021

Copy link
Copy Markdown
Member Author

More thoughts on API's;

The zstd "bufferless streaming API" works by effectively processing and flushing any partial blocks in the input buffer. This means the size of your input buffer is crucial for efficient compression, since fragmenting blocks wastes compression potential.

We could do the same thing with librsync, and "flush" at the end of every input buffer. This would mean that the last block of every input buffer would always be flushed as a "miss" or "match", missing out on any potential matches at offsets within that block or opportunities for extending that miss/match command with adjacent miss/match data. Provided the input buffers were large enough this might not be too bad.

As a hybrid that avoids this, we require the input buffer contains more than one whole block, and we always leave upto 1 block of tail data behind in the input buffer that needs to be shuffled to the start of the input buffer. This means we must flush the data before the last block, sacrificing extending the miss/match if it continues, but at least we don't sacrifice potential matches within the block. This is basically shifting the responsibility for shuffling data in the input buffer to the client, with a small cost in the encoding efficiency at input buffer boundaries.

Note that it also means the output result is dependent on the input buffer sizes used, not just the data contents.

@dbaarda

dbaarda commented Sep 9, 2021

Copy link
Copy Markdown
Member Author

There is another hybrid option between these; if the input buffer contains sufficient data to make progress use it and leave the tail behind for the client to shuffle into the next input buffer, otherwise start accumulating data into the internal buffer.

This means we can avoid copying into the internal buffer if the client gives large enough input buffers, but still work if the client gives small buffers. This is relatively easy to add to the current scoop implementation so I'm going to play with it in another PR.

@dbaarda

dbaarda commented Sep 16, 2021

Copy link
Copy Markdown
Member Author

In #234 I've implemented the idea of processing directly from the input stream buffer and only accumulating into the internal scoop buffer if there is not enough input data. This means we leave a "tail" of unprocessed data in the input buffer when we have directly processed the input data and made progress, but need more input to make further progress. We only start accumulating into the internal buffer if there is insufficient input data at the start of an iteration. This provides a measurable performance improvement of 5~20%.

One of the consequences is the callback api approach in this PR is going to trigger copies into the internal buffer before any attempt to flush or finalise the data. This is because the rs_send_t() callback either provides data OR sends a sync/done mark by setting len to a special negative value. This means that we need to finish sending the last tail of data before we can send the sync/done mark. Since this is always going to be the "tail" that can't be processed without more data, it WILL be copied into the internal buffer.

This might be a minor problem, since sending sync/done marks is probably rare, and after the sync is done future input will restart being processed from the input stream directly (since the internal buffer will be emptied by the sync). However, it is a little messy.

The alternative is to copy the zstd approach of having an additional argument to indicate if we want to flush/end after the input data is processed;

https://github.com/facebook/zstd/blob/6b4a3e019f8eeb3423065f7b24d790358e8cbc59/lib/zstd.h#L683

This enables us to provide data AND a sync/done mark at the same time, giving the processing the signal needed to process the tail of data without copying it into the internal buffer.

I'm going to play with extending this PR to do that after merging in #234.

@sourcefrog

Copy link
Copy Markdown
Contributor

That sounds cool! Let me know if you would like me to read this (or any PR) for a second pair of eyes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants