Tag Parsing C++

Tag Parsing C++

  • Comments 15

Hello, my name is Thierry Miceli and I am a developer on the Visual C++ Compiler Front End team. Although our team is mostly known for writing and maintaining the part of the C++ compiler that analyzes your source code and builds an internal representation from it, a great deal of our effort in the last few years has been directed into servicing the IDE and improving the intellisense experience (refreshers here, here, and here).

Today, I am going to write about a new parser that has been specifically created to provide a fast and scalable way to extract information from C++ source code. This parser is one of our new additions to Visual Studio 2010 and we call it the “tag parser”.

The tag parser is used in Visual Studio 2010 to populate the SQL database that supersedes the NCB file. All of the browsing features of VC++ rely in some way on results provided by the tag parser. These include Class View, Call Hierarchy, Go To Definition/Declaration, Get All References, Quick Search, the Navigation Bar and VCCodeModel.

A Fuzzy Parser

It is a fuzzy parser, which means that instead of trying to strictly recognize and validate the full C++ syntax (we have an excellent compiler front-end to do that) it lazily matches an input stream of tokens with some patterns. This parser doesn’t populate a symbol table during parsing, it has no notion of types apart from built-in ones, it doesn’t build a full macro context and its unit of translation is a single file (i.e. it doesn’t follow through #include directives). But nevertheless, the parser is able to deal with all of C++, C++/CLI and IDL.

High level of tolerance to incomplete code and errors.

The tag parser doesn’t try to make sense of every symbol or identifier in the source code. It will be satisfied with being able to recognize the different parts of a declaration and their positions in the source file. If a name in the type specification of a declaration couldn’t get resolved by our C++ compiler this would not prevent the tag parser from recognizing the declaration and it will show up in Class View.

The tag parser is somewhat analogous to a human reader of the source code that would just be looking at one unique declaration without knowing much about the rest of the project. He may not know what most identifiers actually represent but he can tell with a high level of confidence what the declaration is and locate its subparts.

In addition to the tolerance to ‘semantically’ incorrect code which is a property of fuzzy parsers, the tag parser has heuristic based error recovery for the most common causes of erroneous code during editing. For example, it will try to detect incomplete declarations or unclosed body of functions definitions.

Dealing with preprocessor conditional directives.

The tag parser’s main role is to extract information from the source code that is then consumed by the IDE browsing features. Because browsing features closely relate to the editing experience it is more useful that the tag parser generates a structured representation of the full source code as it appears in the editor rather than a representation of the code that would get compiled under a specific project configuration.

The tag parser deals with preprocessor conditional directives (#if, #ifdef, #ifndef, #else, #elif, #endif) in a special way. It incorporates the full code in each of the branches of preprocessor conditional directives but still only parses complete declarations. For example, both the inactive and active branches are parsed and Class View shows both function declarations.

The tag parser is also able to deal with more complex cases where a declaration is interrupted by one or more preprocessor conditional directives. For example, both of the declarations that can be induced by 2 branches are parsed and reported.

Faster and scalable

Tag parsing scales because it is incremental – it doesn’t need to re-parse hundreds (or thousands) of compilation units after a header file is changed, as is often the case in an actual build.  It is also faster than a full compiler (despite its heuristics) because it is not burdened by macro expansion and full semantic resolution.  Thus it is well suited to capture real-time information for even the largest projects.

No built-in semantic resolution

Since the tag parser operates strictly on a per-file basis, certain semantic resolutions are left to its clients. For example, since function declarations and definitions typically appear in separate files,the tag parser reports a function declaration and its definition separately without any binding information. Therefore Class View has to match a function declaration and its definition so that they appear as a single entry in the Class View tree.

The tag parser is light-weight and this comes with some responsibilities on the side of the consumers of the parser results. The good thing here is that clients only have to incur the cost of building the semantic knowledge that they need and they can dig into the data with SQL now.

Hint Files

We tried to make the tag parser as standalone as possible. It doesn’t need to know about any kind of project configuration (include paths, compiler switches, etc…). In many cases the tag parser could be invoked with a source file name as its only argument and it would do an excellent job at extracting detailed information about the code in this file. The only caveat is preprocessor macros that interfere with the C++ syntax so badly that fuzzy parsing and error recovery heuristics cannot make sense out of the code. One example of such macro is STDMETHOD, when expanded it will generate a member function signature from something like: 

STDMETHOD(OnDocWindowActivate)(BOOL fActivate)

 

You’ll have a hard time guessing what the above line means if you don’t know what STDMETHOD is. Since the tag parser doesn’t follow through #include directives and doesn’t perform SQL lookups into the symbol database*, it cannot discover by itself macro definitions. Nevertheless, its macro state can be preconfigured with what we call a ‘hint file’. A hint file simply contains the definitions of macros that are needed for the tag parser to correctly recognize your source code in the presence of macros that fundamentally interfere with the C++ syntax.

 If you have Beta1 installed, you will find a “cpp.hint” file in your Visual Studio 2010 install directory under vc\vcpackages, this is the hint file for the VC and SDK library headers. Very often the tag parser will do just fine with only this preset hint file. Nevertheless, if your code or some third party library code you are using contains macros that tamper with the C++ syntax, you may need to setup your own hint file. The IDE will look for files named “cpp.hint” in the directory where your source files are located and in all the parent directories up to the root directory or until a file named “cpp.stop” is found. All the hint files that are found will be preprocessed to build the macro context before your files actually get to be parsed. I won’t go into more details about hint files for now but feel free to ask questions and, by the way, they will be thoroughly documented on MSDN.

Don’t worry too much if this machinery seems complex, most of the time you won’t have to define your own hint files or you’ll just need to drop a “cpp.hint” file with a few macro definitions in your project or solution directory.

In the future we are planning to work on tools that will help you decide where hint files are needed and possibly generate them for you. And we will also work on making the tag parser act smarter in the presence of macros so that fewer hints need to be added to a hint file.

*In theory the tag parser could query the database for macros definitions when additional information is needed to recognize or disambiguate a declaration, but a reliable implementation of symbol lookups (even if it was only for macros) would push the tag parser in the opposite direction of being light-weight, standalone, incremental and independent from project configurations.

This post was updated on March 25 2013 to remove broken links to missing images, and references in the blog post to those images.

  • Two questions:

    1) Is tag parser used to provide information for listed tools all the time, or only until the full faithful parsing pass completes?

    2) Is it actually possible to use the tag parser as a standalone console application (i.e. run it on a file, and then work with produced tag info in some way)?

  • 1) The tag parser stores the information it extracts from the source code in a SQL database.

    When source files are modified these files are reparsed and the database updated (this is done in realtime). The tools fetch the information they need from the database.

    2) The tag parser is used internally for other applications than Visual Studio. We are considering publishing it externally but we cannot promise anything for now. The browsing database is a regular SQL CE database though. For some applications it may be preferable to directly access the database.

  • No intellisense for C++/CLI? What is Microsoft doing?

  • I may be misunderstanding the post, but my fear is that code browsing will still be inaccurate with this new functionality.

  • uus831: They are prioritizing. Not enough time to please everyone.

    Perhaps you can join the Microsoft team and implement it for C++/CLI? :D

  • It is true that tag parsing gives up some amount of correctness for speed; but this inaccurate (actually, better to call it "incomplete" or "unverified") information can subsequently be validated by the same intellisense parser that provides Quick Info or Parameter Help.

    As a case-study, we can examine the browse case of "goto definition." First, the browse database (previously populated with the tag parser) is queried for all possible matching declarations, and then the intellisense parser is invoked on these potential matches to find the one that exactly matches the original request.

    This combination of two parsers enables us to not only provide results as fast as in previous releases (regardless of solution size), but with far greater accuracy.

    Andy Rich

    Visual C++ QA

  • @Aleria "No time to please everyone?"

    They had intellisense before in VS2005/2008, and they completely removed it! It seems Microsoft is killing C++/CLI!

  • @uus831: scan the vcblog to find a statement that quite some code at Microsoft is in C++/CLI as they need lots of native/managed glue. So Microsoft has enough persons to speak up for C++/CLI internally. No need tp fear its death.

  • If you're a C++/CLI user, don't migrate your project to VS 2010 until Intellisense is implemented. 10 may be the new 6, but C++/CLI is the new Vista :-D.

  • @uus831, IntelliSense has a completely new engine in VS2010. It gives significantly better results for native code (it can handle Boost.Lambda, and that's very impressive!), but consequently they had no time to add support for C++/CLI to it.

    And yes, there's quite a bit of C++/CLI code within Microsoft - not just glue, but it's the only way to unit test C++ code with MSTest currently...

  • About intellisense support for C++/CLI, this topic has already been discussed in this post:

    http://blogs.msdn.com/vcblog/archive/2009/05/27/rebuilding-intellisense.aspx

    Please, make sure to read Boris Jabes answer: http://blogs.msdn.com/vcblog/archive/2009/05/27/rebuilding-intellisense.aspx#9656003

    We definitely didn't drop the ball on C++/CLI. We completely rebuilt intellisense and haven't had the time to add support for C++/CLI.

    Thanks,

    Thierry

  • I guess, i will stick with Visual Assist!

  • Visual Assist guys will also get some benefit from this new Tag parser,I hope.

  • Will this solve the problem (that I am seeing all the time with native C++ code in VS2008) where the intellisense and "goto identifier" dont always update and end up pointing at old or non-existent locations (in some cases deleting all the temporary files including the ncb file and doing a full rebuild is the only way to solve it)

  • Jonathan,

    The generation of the data used for browsing and intellisense is incremental most of the time. So the delay between your last edits and the moment when the data will be up to date should be minimal.

    You'll probably also find intellisense to be more robust in general. This is a complete rearchitecture so you'll see a lot of improvements.

    Note that you may find the new "Navigate To" feature more handy than "Goto Definition/Declaration", it is really fast and shows a preview.

    Thanks,

    Thierry Miceli

Page 1 of 1 (15 items)