Skip to content

Performance improvement of IfcParse tokenizer#82

Merged
aothms merged 15 commits into
IfcOpenShell:masterfrom
stgatilov:master
Jun 9, 2016
Merged

Performance improvement of IfcParse tokenizer#82
aothms merged 15 commits into
IfcOpenShell:masterfrom
stgatilov:master

Conversation

@stgatilov

Copy link
Copy Markdown
Contributor

In my opinion, most of the performance issues in IfcParse are due to too many dynamic memory allocations.
This PR is dedicated to removing memory allocations in tokenization process, at least for numbers and operators.

As a result, opening a 47Mb file in our IfcParse-based project now takes 9.5 seconds instead of 12.5 seconds. According to MSVC profiler, samples count of IfcParse.dll has reduced from 3600 to 2600. Built with MSVC2013 x64 with additional optimizations enabled.

In the new tokenization system, most of the information for each token is determined at the moment it is created, and it is saved within the Token struct itself. In particular:

  1. startPos - index of first char in token (was saved before).
  2. endPos - index of the after-the-last char of token (I thought it would be useful, but it is not actually used).
  3. type - operator/keyword/integer/bool/whatever.
  4. value - only for some types of tokens, stored in union.
    For simple tokens, their values are determined on creation without any memory allocations. This allows to avoid costly creation of std::string-s in IfcSpfLexer::TokenString function.

There are several points I would like to discuss about these changes:

(1) According to STEP specification, EOL chars may appear in any tokens. So breaking a real number into two lines is considered a correct STEP. This is rather unpleasant for tokenization, because it is not possible to parse numbers with standard libraries "in-place" (unless all EOLs are removed even before tokenization).
In order to support such weird tokens, I copy each token into a 64-byte buffer on stack, removing all EOL in process (see function RemoveTokenSeparators). This allows to avoid dynamic allocation, but limits length of any integer or real number to 63 characters. I wonder if it is really a problem and if anyone knows a way to fix it (without dynamic allocation).
For example, the following valid numbers would screw tokenization process in the PR:

  • integer: 00000...[repeat 70 times]...0000078
  • real: 14.1039...[write 70 digits]...23569

(2) IfcParse library lacks an assert macro.
Throwing exception is the right thing to do on ill-formatted inputs. But sometimes it is desirable to check for conditions which must be true by construction, in order to simplify debugging process. And throwing exception in a small getter function is a waste of program time and space. See functions asInt, asFloat and other for example.

Comment thread src/ifcparse/IfcParse.cpp
// Opens the file, gets the filesize and reads a chunk in memory
//
IfcSpfStream::IfcSpfStream(const std::string& fn)
: stream(0)

@Stinkfist0 Stinkfist0 Jun 2, 2016

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the convention will be to use spaces instead of tabs, hence I've already started to use spaces in order to "keep the credit" of the line when all of these are converted at some point, #21

@Stinkfist0

Copy link
Copy Markdown
Contributor

Good stuff!

@stgatilov

Copy link
Copy Markdown
Contributor Author

Thank you!
I hope someone familiar with the original code would find time to review the changes.

@aothms

aothms commented Jun 2, 2016

Copy link
Copy Markdown
Member

Indeed many thanks for this, especially considering we all know this is not the best maintained part of the code base. I will have time for a closer look probably tomorrow.

But indeed,

EOL chars may appear in any tokens

this is indeed the main reason why there are these std::string allocations in the parser code base now. First of all, my impression is that random new lines within tokens will break many of the parsers out there. Also, numbers of more than 63 digits I find a bit of a pathological example. But how about the other option you mention, stripping all carriage returns and newlines before tokenization, you should be able to do this in-place without too much overhead am I right? Another monstrous option of course is to provide drop in replacements for ato_() that skip over newline characters there, surely there must be reliable sources for this we can adapt. So to summarize:

  • Allow newline to break parser
  • Use buffer on stack (This PR)
  • Use dynamic buffer (master)
  • Dynamic static buffer (beware multi-threading).
  • Pre-process to remove newlines (beware BUF_SIZE macro)
  • Alternative ato_() implementations that ignore newlines
  • More...?

Any thoughts? We can defer a final decision to the 0.6 release and stick to this PR for now.

Regarding the exceptions: 5 years ago many IFC files were semantically and syntactically invalid in various ways and we felt that a very permissive parser, one that would just keep churning along until some exception was thrown, was most desirable. Still at this time, many IFC files have their subtle issues, such as issues with cardinalities. I am not entirely sure how this relates to your proposal of an assertion macro. I do agree that either a strict parsing mode or an earlier indication of whether the file conforms to at least the subset of the schema you are interested in would be valuable and would also keep the code (especially the geometry interpretation which is full of try catch blocks for this reason) cleaner.

@stgatilov

Copy link
Copy Markdown
Contributor Author

Stripping all EOLs before parsing is trivial to do, but there is one problem.
It seems that there exists a streaming mode in IfcSpfStream, it can be enabled by defining a macro BUF_SIZE. And it is not possible to add input preprocessing with this feature enabled. Well, at least it is not easy to do: it is necessary to maintain correspondence between real positions in input file and virtual positions in the resulting preprocessed buffer.
Other from that, it should be simple and relatively fast. Moreover, it can perhaps simplify parsing even further, especially the TokenString function.
BTW, IfcPlusPlus strips all EOLs and comments before parsing.

I also feel that newlines inside numbers is never generated and not supported by many programs. But a standard is a standard, unfortunately. A very long integer is indeed a highly unlikely case, but a very long float seems to be possible.

As for writing custom atoi and atod equivalents, I was also considering this option. But it is quite a lot of nontrivial code to write and debug.

Another option is to create a single std::string object, and use it as a temporary location for all the tokens (and shrink only when finished). It can be trivially achieved by making a static local variable in IfcParse::GeneralTokenPtr function. However, this would immediately break multithreaded parsing. This temporary buffer can be placed into some object like IfcSpfStream (or IfcFile), then it would work safely under condition that no IfcSpfStream object is used by several threads simultaneously.

@aothms

aothms commented Jun 2, 2016

Copy link
Copy Markdown
Member

Yep, good point. This BUF_SIZE has not been used as the intermittent seeking was very detrimental to performance. However, with this pull request and the most prominent token values actually represented in memory, it might be worthwhile to reconsider this streaming parsing.

Of course there are good examples of atoi and atod, e.g [1] we don't necessarily need to reinvent the wheel. But I too am not very much in favor of this option.
[1] ftp://ftp.irisa.fr/pub/OpenBSD/src/sys/lib/libsa/strtol.c

Multi-threaded parsing seems quite far out to me still. I am also fine with static buffer that can grow. I am sure we can we can somehow instantiate different buffers per thread later if that ever becomes imminent.

@stgatilov

Copy link
Copy Markdown
Contributor Author

I have implemented the last way: added a _tempString field to IfcSpfLexer, which is used in several cases as local storage. So numbers can have unlimited length now. It also allows to handle keywords without allocations.

P.S. I'm sorry for the EOL style: it seems that I have messed it in the first bunch of commits.

@aothms aothms mentioned this pull request Jun 3, 2016
@aothms

aothms commented Jun 6, 2016

Copy link
Copy Markdown
Member

Thanks again. I think that's a good approach.

Now, I ran IfcConvert compiled with this PR and master and compared the perfomance. It's not really conclusive or anything as it is hard to eliminate e.g hard drive caching of the file. It does seem overall the performance increases nicely. IfcConvert is not the best executable to test this PR as most of the time is spent inside Open Cascade modules. Memory usages increases a bit too, which is not unexpected. I wonder if we can reduce it a bit though. One potential option would be to store aggregates not as separate tokens, but already aggregate them into vectors at the parsing phase.

clinic_mep_20110906_optimized ifc

With this python script I isolated a bit more on the performance of the parser

import sys, time, ifcopenshell
li = []
t0 = time.clock()
f = ifcopenshell.open(sys.argv[1])
t1 = time.clock()
for pt in f.by_type("IfcCartesianPoint"):
    li.append(pt.Coordinates)
t2 = time.clock()
print t2 - t1, t1 - t0

The differences are quite significant:

|        | t2-t1 | t1-t0 | 
| pr     | 6.7   | 12.4  | 
| master | 9.5   | 14.6  |

I'll be ready to merge it in soon.

@aothms

aothms commented Jun 9, 2016

Copy link
Copy Markdown
Member

With regards to the exception / assertion, I agree things were not ideal. However, by not throwing exception, invalid data will seep through the cracks now. For example:

#11=IFCGEOMETRICREPRESENTATIONCONTEXT('Plan','Model',3,.T.,#10,#6);
                                                        ^ Precision should be REAL

f.by_id(11).Precision will now happily return *(float*) (&0xcccccc01) (on a msvc debug build, otherwise 3 bytes of garbage). Therefore I vote for re-enabling the exceptions in the asFloat() and similar functions until we have something better, as that is what the rest of the codebase is currently expecting.

I agree a check on construction is preferable, for example in the IfcSchema::SchemaEntity() call. As that is were the untyped IfcParse instance is associated to a typed instance from the schema. If it fails that check, the instance would simply not be inserted in the maps of IfcFile. But that would be for a different PR I would say. The question then still remains what checks to do: attribute types, cardinalities or potentially even inverse cardinalities and where rules, the latter two being astronomically more complicated of course.

In case I don't hear back I will simply add the exceptions and otherwise merge as is.

@stgatilov

Copy link
Copy Markdown
Contributor Author

I agree with you.
I was under impression that caller always checks the type with isFloat function before calling asFloat. But it turns out that it is not so. Committed a small fix.

As for reducing memory consumed, I think implementing special treatment for tessellation entities should improve it a lot (along with performance). I'm not sure: what's the best place to discuss this topic?

@aothms

aothms commented Jun 9, 2016

Copy link
Copy Markdown
Member

Ok, perfect, thanks.

Best place to discuss would be to open an issue here on github and label it improvement, that way we have a reference to the discussion when it gets implemented.

This was referenced Jan 1, 2023
Moult added a commit that referenced this pull request Jun 5, 2026
Add Log.h (in IfcViewerCore) — a tiny stream-style logger that backs
fprintf(stderr,...), with overloads for the common primitives + char
strings. Mimics qInfo()/qWarning()'s syntax surface enough that
mass-replacing qInfo()→Log::info() and qWarning()→Log::warn() keeps
existing call sites parsing unchanged; .noquote() / .nospace() exist
as compat no-ops so chained qInfo().noquote()<<x<<y patterns survive.

QString streaming is a transitional concern — the QString → std::string
sweep (#80) hasn't landed yet, so ViewportWindow and friends still
construct QStrings for log payloads. LogQt.h (in IfcViewer, not Core)
adds the QString / QStringView operator<< overloads so those streaming
sites work without source changes during the in-flight Qt removal.
When #80 retires QString, LogQt.h drops out.

ViewportWindow.cpp: 132 qInfo/qWarning callsites converted. The two
printf-style qInfo("fmt %s", ...) callsites get fprintf with explicit
[info]/[warn] prefixes to keep the output discoverable.

Also de-Qt'd:
  AreaMeasurement.cpp  — 1 qInfo("fmt", …) → fprintf
  SceneLoader.cpp      — 4 qDebug + 1 qWarning printf-style → fprintf
  GeometryStreamer.cpp — 2 qDebug printf-style → fprintf
  ifcviewer-minimal/main.cpp — 2 qWarning << → Log::warn

Drops <QDebug> from each. Closes #82.

Builds: desktop / bonsai / web all green. Tests 100/100 pass.
Moult added a commit that referenced this pull request Jun 5, 2026
Take QString out of ViewportWindow's outward-facing surface so it can
eventually move into a Qt-free ViewportCore:

  void     queueLoadSidecar(const QString&)    →  (const std::string&)
  uint32_t loadSidecar(const QString&)         →  (const std::string&)
  QString  cameraString() const                →  std::string …
  void     captureNextFrameToPng(const QString&, bool)
                                               →  (const std::string&, bool)
  void     setHudText(const QString&)          →  (const std::string&)

Internal members also moved off QString:
  std::deque<QString> pending_sidecars_     →  std::deque<std::string>
  QString             pending_screenshot_path_ →  std::string

Implementation strategy: convert at the boundary where ViewportWindow
still leans on Qt internals — `loadSidecar` bridges to QString once
for QFile/QDir/QFileInfo path handling; the screenshot save path
constructs a QString locally for QImage::save; the OverlayRenderer's
HUD setter still takes QString so setHudText converts before calling
through. Each of those bridges goes away when ViewportCore lands and
OverlayRenderer / SceneLoader / SidecarBuilder get their own de-Qt
sweeps. cameraString now produces its CSV via snprintf — no QString
ever instantiated.

Bonsai-side updates (compile-only):
  ifcviewer-minimal/main.cpp — queueLoadSidecar / captureNextFrameToPng
                              callers add .toStdString() on the QString
                              parser result
  modules/viewport/View.cpp  — setHudText callers add .toStdString() to
                              their `QString::arg(...)` formatter chains;
                              two `QString()` empty sentinels become
                              `std::string()`
  Measurement.cpp            — same pattern, two setHudText sites
  ifcviewer/LengthMeasurement.cpp — same, three sites

The cameraString string-streaming fix-up in ViewportWindow.cpp drops
the temporary .toUtf8().constData() bridge from #82 — Log::Stream's
std::string overload now handles it directly.

Builds: desktop / bonsai / web all green. Tests 100/100. Closes #80.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants