Performance improvement of IfcParse tokenizer#82
Conversation
…on operator (removed those from derived classes).
…r setting elements.
…check stored type.
| // Opens the file, gets the filesize and reads a chunk in memory | ||
| // | ||
| IfcSpfStream::IfcSpfStream(const std::string& fn) | ||
| : stream(0) |
There was a problem hiding this comment.
BTW the convention will be to use spaces instead of tabs, hence I've already started to use spaces in order to "keep the credit" of the line when all of these are converted at some point, #21
|
Good stuff! |
|
Thank you! |
|
Indeed many thanks for this, especially considering we all know this is not the best maintained part of the code base. I will have time for a closer look probably tomorrow. But indeed,
this is indeed the main reason why there are these
Any thoughts? We can defer a final decision to the 0.6 release and stick to this PR for now. Regarding the exceptions: 5 years ago many IFC files were semantically and syntactically invalid in various ways and we felt that a very permissive parser, one that would just keep churning along until some exception was thrown, was most desirable. Still at this time, many IFC files have their subtle issues, such as issues with cardinalities. I am not entirely sure how this relates to your proposal of an assertion macro. I do agree that either a strict parsing mode or an earlier indication of whether the file conforms to at least the subset of the schema you are interested in would be valuable and would also keep the code (especially the geometry interpretation which is full of try catch blocks for this reason) cleaner. |
|
Stripping all EOLs before parsing is trivial to do, but there is one problem. I also feel that newlines inside numbers is never generated and not supported by many programs. But a standard is a standard, unfortunately. A very long integer is indeed a highly unlikely case, but a very long float seems to be possible. As for writing custom atoi and atod equivalents, I was also considering this option. But it is quite a lot of nontrivial code to write and debug. Another option is to create a single std::string object, and use it as a temporary location for all the tokens (and shrink only when finished). It can be trivially achieved by making a static local variable in IfcParse::GeneralTokenPtr function. However, this would immediately break multithreaded parsing. This temporary buffer can be placed into some object like IfcSpfStream (or IfcFile), then it would work safely under condition that no IfcSpfStream object is used by several threads simultaneously. |
|
Yep, good point. This Of course there are good examples of Multi-threaded parsing seems quite far out to me still. I am also fine with static buffer that can grow. I am sure we can we can somehow instantiate different buffers per thread later if that ever becomes imminent. |
|
I have implemented the last way: added a _tempString field to IfcSpfLexer, which is used in several cases as local storage. So numbers can have unlimited length now. It also allows to handle keywords without allocations. P.S. I'm sorry for the EOL style: it seems that I have messed it in the first bunch of commits. |
|
Thanks again. I think that's a good approach. Now, I ran IfcConvert compiled with this PR and master and compared the perfomance. It's not really conclusive or anything as it is hard to eliminate e.g hard drive caching of the file. It does seem overall the performance increases nicely. IfcConvert is not the best executable to test this PR as most of the time is spent inside Open Cascade modules. Memory usages increases a bit too, which is not unexpected. I wonder if we can reduce it a bit though. One potential option would be to store aggregates not as separate tokens, but already aggregate them into vectors at the parsing phase. With this python script I isolated a bit more on the performance of the parser The differences are quite significant: I'll be ready to merge it in soon. |
|
With regards to the exception / assertion, I agree things were not ideal. However, by not throwing exception, invalid data will seep through the cracks now. For example:
I agree a check on construction is preferable, for example in the In case I don't hear back I will simply add the exceptions and otherwise merge as is. |
|
I agree with you. As for reducing memory consumed, I think implementing special treatment for tessellation entities should improve it a lot (along with performance). I'm not sure: what's the best place to discuss this topic? |
|
Ok, perfect, thanks. Best place to discuss would be to open an issue here on github and label it improvement, that way we have a reference to the discussion when it gets implemented. |
Add Log.h (in IfcViewerCore) — a tiny stream-style logger that backs fprintf(stderr,...), with overloads for the common primitives + char strings. Mimics qInfo()/qWarning()'s syntax surface enough that mass-replacing qInfo()→Log::info() and qWarning()→Log::warn() keeps existing call sites parsing unchanged; .noquote() / .nospace() exist as compat no-ops so chained qInfo().noquote()<<x<<y patterns survive. QString streaming is a transitional concern — the QString → std::string sweep (#80) hasn't landed yet, so ViewportWindow and friends still construct QStrings for log payloads. LogQt.h (in IfcViewer, not Core) adds the QString / QStringView operator<< overloads so those streaming sites work without source changes during the in-flight Qt removal. When #80 retires QString, LogQt.h drops out. ViewportWindow.cpp: 132 qInfo/qWarning callsites converted. The two printf-style qInfo("fmt %s", ...) callsites get fprintf with explicit [info]/[warn] prefixes to keep the output discoverable. Also de-Qt'd: AreaMeasurement.cpp — 1 qInfo("fmt", …) → fprintf SceneLoader.cpp — 4 qDebug + 1 qWarning printf-style → fprintf GeometryStreamer.cpp — 2 qDebug printf-style → fprintf ifcviewer-minimal/main.cpp — 2 qWarning << → Log::warn Drops <QDebug> from each. Closes #82. Builds: desktop / bonsai / web all green. Tests 100/100 pass.
Take QString out of ViewportWindow's outward-facing surface so it can
eventually move into a Qt-free ViewportCore:
void queueLoadSidecar(const QString&) → (const std::string&)
uint32_t loadSidecar(const QString&) → (const std::string&)
QString cameraString() const → std::string …
void captureNextFrameToPng(const QString&, bool)
→ (const std::string&, bool)
void setHudText(const QString&) → (const std::string&)
Internal members also moved off QString:
std::deque<QString> pending_sidecars_ → std::deque<std::string>
QString pending_screenshot_path_ → std::string
Implementation strategy: convert at the boundary where ViewportWindow
still leans on Qt internals — `loadSidecar` bridges to QString once
for QFile/QDir/QFileInfo path handling; the screenshot save path
constructs a QString locally for QImage::save; the OverlayRenderer's
HUD setter still takes QString so setHudText converts before calling
through. Each of those bridges goes away when ViewportCore lands and
OverlayRenderer / SceneLoader / SidecarBuilder get their own de-Qt
sweeps. cameraString now produces its CSV via snprintf — no QString
ever instantiated.
Bonsai-side updates (compile-only):
ifcviewer-minimal/main.cpp — queueLoadSidecar / captureNextFrameToPng
callers add .toStdString() on the QString
parser result
modules/viewport/View.cpp — setHudText callers add .toStdString() to
their `QString::arg(...)` formatter chains;
two `QString()` empty sentinels become
`std::string()`
Measurement.cpp — same pattern, two setHudText sites
ifcviewer/LengthMeasurement.cpp — same, three sites
The cameraString string-streaming fix-up in ViewportWindow.cpp drops
the temporary .toUtf8().constData() bridge from #82 — Log::Stream's
std::string overload now handles it directly.
Builds: desktop / bonsai / web all green. Tests 100/100. Closes #80.

In my opinion, most of the performance issues in IfcParse are due to too many dynamic memory allocations.
This PR is dedicated to removing memory allocations in tokenization process, at least for numbers and operators.
As a result, opening a 47Mb file in our IfcParse-based project now takes 9.5 seconds instead of 12.5 seconds. According to MSVC profiler, samples count of IfcParse.dll has reduced from 3600 to 2600. Built with MSVC2013 x64 with additional optimizations enabled.
In the new tokenization system, most of the information for each token is determined at the moment it is created, and it is saved within the Token struct itself. In particular:
For simple tokens, their values are determined on creation without any memory allocations. This allows to avoid costly creation of std::string-s in IfcSpfLexer::TokenString function.
There are several points I would like to discuss about these changes:
(1) According to STEP specification, EOL chars may appear in any tokens. So breaking a real number into two lines is considered a correct STEP. This is rather unpleasant for tokenization, because it is not possible to parse numbers with standard libraries "in-place" (unless all EOLs are removed even before tokenization).
In order to support such weird tokens, I copy each token into a 64-byte buffer on stack, removing all EOL in process (see function RemoveTokenSeparators). This allows to avoid dynamic allocation, but limits length of any integer or real number to 63 characters. I wonder if it is really a problem and if anyone knows a way to fix it (without dynamic allocation).
For example, the following valid numbers would screw tokenization process in the PR:
(2) IfcParse library lacks an assert macro.
Throwing exception is the right thing to do on ill-formatted inputs. But sometimes it is desirable to check for conditions which must be true by construction, in order to simplify debugging process. And throwing exception in a small getter function is a waste of program time and space. See functions asInt, asFloat and other for example.