Performance improvement of IfcParse tokenizer by stgatilov · Pull Request #82 · IfcOpenShell/IfcOpenShell

stgatilov · 2016-06-02T08:53:55Z

In my opinion, most of the performance issues in IfcParse are due to too many dynamic memory allocations.
This PR is dedicated to removing memory allocations in tokenization process, at least for numbers and operators.

As a result, opening a 47Mb file in our IfcParse-based project now takes 9.5 seconds instead of 12.5 seconds. According to MSVC profiler, samples count of IfcParse.dll has reduced from 3600 to 2600. Built with MSVC2013 x64 with additional optimizations enabled.

In the new tokenization system, most of the information for each token is determined at the moment it is created, and it is saved within the Token struct itself. In particular:

startPos - index of first char in token (was saved before).
endPos - index of the after-the-last char of token (I thought it would be useful, but it is not actually used).
type - operator/keyword/integer/bool/whatever.
value - only for some types of tokens, stored in union.
For simple tokens, their values are determined on creation without any memory allocations. This allows to avoid costly creation of std::string-s in IfcSpfLexer::TokenString function.

There are several points I would like to discuss about these changes:

(1) According to STEP specification, EOL chars may appear in any tokens. So breaking a real number into two lines is considered a correct STEP. This is rather unpleasant for tokenization, because it is not possible to parse numbers with standard libraries "in-place" (unless all EOLs are removed even before tokenization).
In order to support such weird tokens, I copy each token into a 64-byte buffer on stack, removing all EOL in process (see function RemoveTokenSeparators). This allows to avoid dynamic allocation, but limits length of any integer or real number to 63 characters. I wonder if it is really a problem and if anyone knows a way to fix it (without dynamic allocation).
For example, the following valid numbers would screw tokenization process in the PR:

integer: 00000...[repeat 70 times]...0000078
real: 14.1039...[write 70 digits]...23569

(2) IfcParse library lacks an assert macro.
Throwing exception is the right thing to do on ill-formatted inputs. But sometimes it is desirable to check for conditions which must be true by construction, in order to simplify debugging process. And throwing exception in a small getter function is a waste of program time and space. See functions asInt, asFloat and other for example.

…on operator (removed those from derived classes).

…r setting elements.

…air.

…check stored type.

Stinkfist0 · 2016-06-02T09:09:25Z

 // Opens the file, gets the filesize and reads a chunk in memory
 //
 IfcSpfStream::IfcSpfStream(const std::string& fn)
-    : stream(0)


BTW the convention will be to use spaces instead of tabs, hence I've already started to use spaces in order to "keep the credit" of the line when all of these are converted at some point, #21

Stinkfist0 · 2016-06-02T09:11:41Z

Good stuff!

stgatilov · 2016-06-02T10:30:42Z

Thank you!
I hope someone familiar with the original code would find time to review the changes.

aothms · 2016-06-02T12:09:55Z

Indeed many thanks for this, especially considering we all know this is not the best maintained part of the code base. I will have time for a closer look probably tomorrow.

But indeed,

EOL chars may appear in any tokens

this is indeed the main reason why there are these std::string allocations in the parser code base now. First of all, my impression is that random new lines within tokens will break many of the parsers out there. Also, numbers of more than 63 digits I find a bit of a pathological example. But how about the other option you mention, stripping all carriage returns and newlines before tokenization, you should be able to do this in-place without too much overhead am I right? Another monstrous option of course is to provide drop in replacements for ato_() that skip over newline characters there, surely there must be reliable sources for this we can adapt. So to summarize:

Allow newline to break parser
Use buffer on stack (This PR)
Use dynamic buffer (master)
Dynamic static buffer (beware multi-threading).
Pre-process to remove newlines (beware BUF_SIZE macro)
Alternative ato_() implementations that ignore newlines
More...?

Any thoughts? We can defer a final decision to the 0.6 release and stick to this PR for now.

Regarding the exceptions: 5 years ago many IFC files were semantically and syntactically invalid in various ways and we felt that a very permissive parser, one that would just keep churning along until some exception was thrown, was most desirable. Still at this time, many IFC files have their subtle issues, such as issues with cardinalities. I am not entirely sure how this relates to your proposal of an assertion macro. I do agree that either a strict parsing mode or an earlier indication of whether the file conforms to at least the subset of the schema you are interested in would be valuable and would also keep the code (especially the geometry interpretation which is full of try catch blocks for this reason) cleaner.

stgatilov · 2016-06-02T12:37:20Z

Stripping all EOLs before parsing is trivial to do, but there is one problem.
It seems that there exists a streaming mode in IfcSpfStream, it can be enabled by defining a macro BUF_SIZE. And it is not possible to add input preprocessing with this feature enabled. Well, at least it is not easy to do: it is necessary to maintain correspondence between real positions in input file and virtual positions in the resulting preprocessed buffer.
Other from that, it should be simple and relatively fast. Moreover, it can perhaps simplify parsing even further, especially the TokenString function.
BTW, IfcPlusPlus strips all EOLs and comments before parsing.

I also feel that newlines inside numbers is never generated and not supported by many programs. But a standard is a standard, unfortunately. A very long integer is indeed a highly unlikely case, but a very long float seems to be possible.

As for writing custom atoi and atod equivalents, I was also considering this option. But it is quite a lot of nontrivial code to write and debug.

Another option is to create a single std::string object, and use it as a temporary location for all the tokens (and shrink only when finished). It can be trivially achieved by making a static local variable in IfcParse::GeneralTokenPtr function. However, this would immediately break multithreaded parsing. This temporary buffer can be placed into some object like IfcSpfStream (or IfcFile), then it would work safely under condition that no IfcSpfStream object is used by several threads simultaneously.

aothms · 2016-06-02T13:13:34Z

Yep, good point. This BUF_SIZE has not been used as the intermittent seeking was very detrimental to performance. However, with this pull request and the most prominent token values actually represented in memory, it might be worthwhile to reconsider this streaming parsing.

Of course there are good examples of atoi and atod, e.g [1] we don't necessarily need to reinvent the wheel. But I too am not very much in favor of this option.
[1] ftp://ftp.irisa.fr/pub/OpenBSD/src/sys/lib/libsa/strtol.c

Multi-threaded parsing seems quite far out to me still. I am also fine with static buffer that can grow. I am sure we can we can somehow instantiate different buffers per thread later if that ever becomes imminent.

stgatilov · 2016-06-03T03:27:36Z

I have implemented the last way: added a _tempString field to IfcSpfLexer, which is used in several cases as local storage. So numbers can have unlimited length now. It also allows to handle keywords without allocations.

P.S. I'm sorry for the EOL style: it seems that I have messed it in the first bunch of commits.

aothms · 2016-06-06T19:07:27Z

Thanks again. I think that's a good approach.

Now, I ran IfcConvert compiled with this PR and master and compared the perfomance. It's not really conclusive or anything as it is hard to eliminate e.g hard drive caching of the file. It does seem overall the performance increases nicely. IfcConvert is not the best executable to test this PR as most of the time is spent inside Open Cascade modules. Memory usages increases a bit too, which is not unexpected. I wonder if we can reduce it a bit though. One potential option would be to store aggregates not as separate tokens, but already aggregate them into vectors at the parsing phase.

With this python script I isolated a bit more on the performance of the parser

import sys, time, ifcopenshell
li = []
t0 = time.clock()
f = ifcopenshell.open(sys.argv[1])
t1 = time.clock()
for pt in f.by_type("IfcCartesianPoint"):
    li.append(pt.Coordinates)
t2 = time.clock()
print t2 - t1, t1 - t0

The differences are quite significant:

|        | t2-t1 | t1-t0 | 
| pr     | 6.7   | 12.4  | 
| master | 9.5   | 14.6  |

I'll be ready to merge it in soon.

aothms · 2016-06-09T12:56:57Z

With regards to the exception / assertion, I agree things were not ideal. However, by not throwing exception, invalid data will seep through the cracks now. For example:

#11=IFCGEOMETRICREPRESENTATIONCONTEXT('Plan','Model',3,.T.,#10,#6);
                                                        ^ Precision should be REAL

f.by_id(11).Precision will now happily return *(float*) (&0xcccccc01) (on a msvc debug build, otherwise 3 bytes of garbage). Therefore I vote for re-enabling the exceptions in the asFloat() and similar functions until we have something better, as that is what the rest of the codebase is currently expecting.

I agree a check on construction is preferable, for example in the IfcSchema::SchemaEntity() call. As that is were the untyped IfcParse instance is associated to a typed instance from the schema. If it fails that check, the instance would simply not be inserted in the maps of IfcFile. But that would be for a different PR I would say. The question then still remains what checks to do: attribute types, cardinalities or potentially even inverse cardinalities and where rules, the latter two being astronomically more complicated of course.

In case I don't hear back I will simply add the exceptions and otherwise merge as is.

stgatilov · 2016-06-09T15:52:06Z

I agree with you.
I was under impression that caller always checks the type with isFloat function before calling asFloat. But it turns out that it is not so. Committed a small fix.

As for reducing memory consumed, I think implementing special treatment for tessellation entities should improve it a lot (along with performance). I'm not sure: what's the best place to discuss this topic?

aothms · 2016-06-09T16:22:33Z

Ok, perfect, thanks.

Best place to discuss would be to open an issue here on github and label it improvement, that way we have a reference to the discussion when it gets implemented.

Add Log.h (in IfcViewerCore) — a tiny stream-style logger that backs fprintf(stderr,...), with overloads for the common primitives + char strings. Mimics qInfo()/qWarning()'s syntax surface enough that mass-replacing qInfo()→Log::info() and qWarning()→Log::warn() keeps existing call sites parsing unchanged; .noquote() / .nospace() exist as compat no-ops so chained qInfo().noquote()<<x<<y patterns survive. QString streaming is a transitional concern — the QString → std::string sweep (#80) hasn't landed yet, so ViewportWindow and friends still construct QStrings for log payloads. LogQt.h (in IfcViewer, not Core) adds the QString / QStringView operator<< overloads so those streaming sites work without source changes during the in-flight Qt removal. When #80 retires QString, LogQt.h drops out. ViewportWindow.cpp: 132 qInfo/qWarning callsites converted. The two printf-style qInfo("fmt %s", ...) callsites get fprintf with explicit [info]/[warn] prefixes to keep the output discoverable. Also de-Qt'd: AreaMeasurement.cpp — 1 qInfo("fmt", …) → fprintf SceneLoader.cpp — 4 qDebug + 1 qWarning printf-style → fprintf GeometryStreamer.cpp — 2 qDebug printf-style → fprintf ifcviewer-minimal/main.cpp — 2 qWarning << → Log::warn Drops <QDebug> from each. Closes #82. Builds: desktop / bonsai / web all green. Tests 100/100 pass.

Take QString out of ViewportWindow's outward-facing surface so it can eventually move into a Qt-free ViewportCore: void queueLoadSidecar(const QString&) → (const std::string&) uint32_t loadSidecar(const QString&) → (const std::string&) QString cameraString() const → std::string … void captureNextFrameToPng(const QString&, bool) → (const std::string&, bool) void setHudText(const QString&) → (const std::string&) Internal members also moved off QString: std::deque<QString> pending_sidecars_ → std::deque<std::string> QString pending_screenshot_path_ → std::string Implementation strategy: convert at the boundary where ViewportWindow still leans on Qt internals — `loadSidecar` bridges to QString once for QFile/QDir/QFileInfo path handling; the screenshot save path constructs a QString locally for QImage::save; the OverlayRenderer's HUD setter still takes QString so setHudText converts before calling through. Each of those bridges goes away when ViewportCore lands and OverlayRenderer / SceneLoader / SidecarBuilder get their own de-Qt sweeps. cameraString now produces its CSV via snprintf — no QString ever instantiated. Bonsai-side updates (compile-only): ifcviewer-minimal/main.cpp — queueLoadSidecar / captureNextFrameToPng callers add .toStdString() on the QString parser result modules/viewport/View.cpp — setHudText callers add .toStdString() to their `QString::arg(...)` formatter chains; two `QString()` empty sentinels become `std::string()` Measurement.cpp — same pattern, two setHudText sites ifcviewer/LengthMeasurement.cpp — same, three sites The cameraString string-streaming fix-up in ViewportWindow.cpp drops the temporary .toUtf8().constData() bridge from #82 — Log::Stream's std::string overload now handles it directly. Builds: desktop / bonsai / web all green. Tests 100/100. Closes #80.

stgatilov added 11 commits June 2, 2016 13:18

Applied MSVC tabify to IfcParse.cpp/h.

2b4d804

Now Argument base class has throwing implementation for each conversi…

6dd0c48

…on operator (removed those from derived classes).

Added NullArgument class for the case when null arguments appear afte…

c438273

…r setting elements.

Now operator tokens are handled just like any other token.

dc688bd

Refactoring: Token is now a struct with good fields, not a nameless p…

42e6eb3

…air.

Now end position of token is set at its construction.

6760beb

Now type of token is determined on construction, as* function simply …

b44547c

…check stored type.

Now int-s, float-s and some etc are parsed when token is created.

a5671f7

Implemented simple way to remove spaces in parsing.

d14ab10

Tokens are now created without allocations.

a1a5fd7

Call RemoveTokenSeparators once for every token.

9b9e253

Stinkfist0 reviewed Jun 2, 2016
View reviewed changes

stgatilov added 3 commits June 2, 2016 23:09

Added temporary std::string object in lexer (for optimization purposes).

9e5e95a

Reimplemented asString without allocations.

1787167

Parsing keywords without allocations.

b128b2e

aothms mentioned this pull request Jun 3, 2016

Fix #76 #79

Closed

Added checks for type to tokenFunc::as*** functions.

35123f3

aothms merged commit 8725c66 into IfcOpenShell:master Jun 9, 2016

stgatilov mentioned this pull request Jun 20, 2016

Logical vs Boolean: ".U." token breaks parsing #94

Closed

aothms mentioned this pull request Sep 26, 2016

build-all.sh generated larger executable and it cost much longer to convert to OBJ #137

Closed

This was referenced Jan 1, 2023

Brep solid from IfcFace #2669

Closed

About surface boolean #2668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvement of IfcParse tokenizer#82

Performance improvement of IfcParse tokenizer#82
aothms merged 15 commits into
IfcOpenShell:masterfrom
stgatilov:master

stgatilov commented Jun 2, 2016

Uh oh!

Stinkfist0 Jun 2, 2016 •

edited

Loading

Uh oh!

Stinkfist0 commented Jun 2, 2016

Uh oh!

stgatilov commented Jun 2, 2016

Uh oh!

aothms commented Jun 2, 2016 •

edited

Loading

Uh oh!

stgatilov commented Jun 2, 2016

Uh oh!

aothms commented Jun 2, 2016

Uh oh!

stgatilov commented Jun 3, 2016

Uh oh!

aothms commented Jun 6, 2016

Uh oh!

aothms commented Jun 9, 2016

Uh oh!

stgatilov commented Jun 9, 2016

Uh oh!

aothms commented Jun 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

stgatilov commented Jun 2, 2016

Uh oh!

Stinkfist0 Jun 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stinkfist0 commented Jun 2, 2016

Uh oh!

stgatilov commented Jun 2, 2016

Uh oh!

aothms commented Jun 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stgatilov commented Jun 2, 2016

Uh oh!

aothms commented Jun 2, 2016

Uh oh!

stgatilov commented Jun 3, 2016

Uh oh!

aothms commented Jun 6, 2016

Uh oh!

aothms commented Jun 9, 2016

Uh oh!

stgatilov commented Jun 9, 2016

Uh oh!

aothms commented Jun 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Stinkfist0 Jun 2, 2016 •

edited

Loading

aothms commented Jun 2, 2016 •

edited

Loading