Hartmut Maennel's Blog

  • A LINQ provider for RDF files - part 2

    For the simple Rdf queries like

    IQueryable<string> q = from x in rdf 

                     from y in rdf

                     where rdf.A(germany, hasAdminDiv, x)

                        && rdf.A(x, isOfType, germanState)

                        && rdf.A(x, hasName, y)
                     select y.Val + "   [" + x.Val + "]";

     

    which we are going to support here there is a “normal form” given by

    -         a set of variables, which denote resources or values in an RDF document – in the example above this is {x,y}.

    -         a set of constraint triples (subj, pred, obj) where subj, pred, obj are either variables or constants. This is the query condition – in the above example it is
    {(germany,hasAdminDiv,x),(x,isOfType,germanState),(x,hasName,y) }

    -         a “projection function” using these variables which denotes the value which we associate with each “row” – in the above example this is
    (x,y) => y.Val + "   [" + x.Val + "]"

     

    To execute such a query means finding all possible assignments of resources / values to the variables such that all resulting triples are in the axioms of the RDF file, and then applying the projection function to get a set of objects of a certain type (the return type of the projection function – in the above example this is string).

     

    The compiler will treat the above query expression as syntactic sugar for an expression like:

     

    rdf.SelectMany(x => rdf.Where(y => Cond(x,y))

                           .Select(y => f(x,y))

                   )

    where Cond(x,y) is the condition involving rdf.A and f(x,y) is the function that assigns a string to each pair (x,y) of values in the Rdf document.

     

    The same query could be written in different forms: For example replacing an expression

      rdf.Where(y => Cond1(x,y) && Cond2(x,y))

    by

      rdf.Where(y => Cond1(x,y)).Where(z => Cond2(x,z))

    should lead to the same normal form.

     

    So how do we get LINQ to translate these expressions to the above normal form?

     

    To get LINQ started, our Rdf type has to implement an IQueryable<T> interface, like the System.Data.DLinq.Table<T> does. When we query a database table without conditions, we get the set of all rows in the table. The analog notion for an RDF file (or RDF files, or any set of Rdf triples) is the set of all “Values” in the RDF document, so we implement the interface IQueryable<Value> on Rdf.

     

    “Value” is the common base type of Literal (meaning a string occurring in an object position in an axiom) and Resource (given by a URI occurring in any position in any axiom).

    Since we usually do not really want to retrieve all values occurring in a document, it does not matter too much what exactly we get when we foreach over a document (e.g. all values or only the resources?), what is more important is the IQueryable part, since that means that now the query operators Where, Select, SelectMany are defined for Rdf.

     

    The basic observation is that we now can give the normal form of a query corresponding to a Rdf object (variables: {x}, constraints: {}, projection: x => x), and we can recursively determine the normal form of a query which is constructed out of these with the operators Where, Select and SelectMany.
    There is some fine print:

    1)      Variables and variable names:
    In rdf.Where(y => Cond1(x,y)).Where(z => Cond2(x,z)) the names y and z correspond to the same variable (which runs over the rdf at the beginning of this expression). We have to be careful to distinguish between variables (that the solver will assign to values) and named references to these variables (like “y” and “z” above).

    2)      Variables can be defined outside of a (sub)expression:
    In rdf.Where(y => Cond1(x,y)) the variable x is defined in an enclosing scope. When we translate a (sub)expression, we always have to give the list of variables in the enclosing scope as a parameter.

    3)      Some restrictions apply:
    - We only deal with Where, Select, SelectMany when applied to a Rdf query with identity projection function, i.e. the output is given by a variable and is a sequence of objects of type Value (e.g. not to a sequence of strings).
    - The conditions in the Where clause only are of the form Rdf.A(?,?,?), the predicate is always given as a constant, and at least one of the entries is a variable.


    With these caveats, here is what this recursive algorithm does:

    -         Where:
    Source.Where(v => Cond(v)):
    Translate the query expression Source. Assume the output of Source is a variable. Make the name v point to the same variable, translate the condition and add the result to the list of constraints.
    The output variable of the new query expression is the same as for Source.

    -         SelectMany:
    Source.SelectMany(v => Seq(v)):
    Translate the query expression Source. Assume the output of Source is given by a variable. Make the name v point to the same variable. Add the variables and constraints of Source and Seq together. The projection function of the result is the projection function of Seq.

    -         Select:
    Source.Select(v=>f(v)):
    Translate the query expression Source. Assume the output of Source is given by a variable. Make the name v point to the same variable. Determine all parameters occurring in f, build a Lambda expression (v1,v2,..,vn) => f(v1,v2,…,vn) and compile it. This is the projection function of the result. The variables and constraints of the result are the same as from source.

     

    I attach a VS2005 solution which implements this algorithm. It assumes the May LINQ CTP is installed.

    It contains four projects:

    -         LinqToRdf is the main project which implements this algorithm
    It uses an ITriplePovider object which enumerates triples, and an ISolver object that implements a solution algorithm that takes “local information” about the possibilities to complete a triple when the predicate and maybe one of subject and object are given, and computes all the possible solutions of a given query (given as a set of query triples).

    -         RdfXmlReader is an implementation of ITripleProvider which reads in an RdfXml file. It uses Drive (see last blog entry), you have to modify the reference to Drive.dll in this project to point to your copy of Drive.dll.

    -         SimpleSolver implements a simple algorithm to solve an Rdf query in the above normal form.

    -         Demo uses these assemblies to read in the RDF files containing information about Germany and France and list all “administrative divisions” of Germany and France.

     

    As always, this sample code is the product of Weekend Evening Rapid Prototyping, it is provided as-is and does not come with any warranty.
    You can copy, modify, and use the code for commercial and non-commercial purposes.

    To build the RdfXmlReader project, you need to download Drive.dll from http://www.driverdf.org/, see there for legal restrictions which may apply to this DLL.

  • A LINQ provider for RDF files

    The next provider I plan to upload here allows querying an RDF file. From the provider writer’s perspective there is a fundamental difference to the previous “Web page” provider: This provider uses IQueryable instead of IEnumerable and transforms .Net expression trees to query objects that can be executed to perform the query.

     

    “RDF” stands for “Resource Description Framework”: RDF files express properties of and relations between “resources” like web pages, articles, or in fact any objects that can be characterized by a URI – and since you can define URIs for everything, RDF can describe anything.

    You can find more material about RDF at http://planetrdf.com/guide/, you can see examples for RDF files describing

    -         parts of the above Web site at http://planetrdf.com/guide/rss.rdf (RSS 1.0 is an RDF format)

    -         recordings of the Kronos Quartet at
    http://musicbrainz.org/mm-2.1/artist/f5586dfa-7031-4af0-8042-19b6a1170389/6  

    -         geographical / political / statistical data about Germany at
    http://www.daml.org/2003/09/factbook/gm

     

    The basic data structure used by an RDF file is a list of “(subject, predicate, object)” triples, where the subject and predicate have to be “resources”, while the object can be either a resource or a literal value (which we will just interpret as a string).


    This RDF structure can be stored in different ways; the most common is an XML format. There is an open source RDF parser called “Drive” (written in C#) which reads in such a RDFXML file and exposes the content as triples; you can download it at http://www.driverdf.org/.

    To see how the above RDF file stores the names of the states of Germany, the LINQ provider will allow you to use the following query:

     

    ...
    RdfXmlDoc rdfGermany = new RdfXmlDoc(fileGermany);
    ISolver solver = new SimpleSolver.Solver();
    Rdf rdf = new Rdf(solver);
    rdf.LoadRdf(rdfGermany);

    string nsRdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";

    ...

    string isOfType = nsRdf + "type";

    string germany = nsCountries + "GM";

    string hasAdminDiv = nsFactbookOnt + "administrativeDivision";

    string hasName = nsFactbookOnt + "name";

    string germanState = rdfGermany.BaseNamespace + "#State";

     

    IQueryable<string> q = from x in rdf 

                     from y in rdf

                     where rdf.A(germany, hasAdminDiv, x)

                        && rdf.A(x, isOfType, germanState)

                        && rdf.A(x, hasName, y)
                     select y.Val + "   [" + x.Val + "]";

     

    RDF and additional layers on top of RDF (RDFS, OWL, OWL Rules) specify how additional triples can be inferred from the assertions in the RDF (OWL, …) files, but in this example this is not used, the rdf.A(…)expressions only consider the assertions in the RDF file themselves.

     

    In the next blog entry I will describe the translation from LINQ queries to “RdfExpression”s which can be executed to efficiently search the Rdf structure for the required matches.

     

    (To be continued…)

  • A LINQ provider for Web queries

    To start a series of "LINQ provider" posts, today I upload a provider sample that in some sense treats the Internet as a database: For a SQL Server database, you can make tables in a database accessible to LINQ by writing classes with attributes that define how objects of these classes are retrieved from rows in tables. LINQ can then use these classes to issue queries against the database. Similarly, this provider allows adding attributes to classes to specify how such objects are retrieved from Web pages, and you can then issue LINQ queries against them.

    The project "WebLinq" in the attached solution contains this provider - it is not very sophisticated, it just contains three files:
    - WebLinqAttributes.cs contains the attributes that are recognized
    - WebContext.cs is the class your WebLinq enabled classes inherit from
    - Utils.cs contains helper functions to GET / POST to a web site and to find substrings in a text.

    The project "WebSources" defines some classes for 
    - Searching for articles in the CiteSeer web sites (see below)
    - Searching for articles in the MSDN web sites
    - Translating words / sentences
    - Integrating functions of one variable
    - Looking up the current values of stocks from the company symbol

    The project "SimpleDemos" uses these two DLLs to demonstrate the last three classes.

    The project "TestWebLinq" demonstrates the access to the CiteSeer web sites.

    CiteSeer is a database of computer science articles; you can search for articles by keywords, and obtain information about articles, and often even retrieve them directly from the Web site.
    To use the CiteSeer demo, enter for example "Support Vector Machines" in the text box labeled "Search terms", and click on the "Retrieve" button. It will take some while to visit the web pages which list available articles, to visit the web page for each article, retrieve the information from this article, and access a another web page for details, but then you should see a list of paragraphs which contain
    - Author's name(s)
    - Title and year
    - Some three lines of introduction
    - URL for this article
    - URL for downloading the article as pdf file
    - Information about the rights for this article

    If you are only interested in new articles, try entering 2002 in the "Publication year >=" text field and click again on "Retrieve" (currently I get 3 results back).

    Here is how the corresponding query looks in the code:

    var doc =
    new GoogleCiteSeer(searchTerms,0);
    var
    query = from art in doc.Articles
               
    where art.details.Document != null
                  
    && art.details.Document.bibtex != null
                  
    && art.details.Document.bibtex.year>=minYear
                
    select art.details;

    Here is an example for a class that defines how to read the "BibTeX" part of the Web page with details for an article:

    public class CsBibTex {
      [
    StartPart("author = \"")] [EndPart("\"")] public string
    author;
      [
    StartPart("title = \"")]  [EndPart("\"")] public string
    title;
      [
    StartPart("year = ")]     [EndPart(",")]  public int
    year;
    }

    This sample code is provided as-is and does not come with any warranty.
    You can modify and use the code for commercial and non-commercial purposes.

  • Workaround: Smart Tags in C# IDE do not work correctly in LINQ preview

    In the LINQ preview (CTP May 2006), the Smart Tags functionality in the C# IDE does not work correctly. In particular, I miss the "Resolve" feature that puts in the necessary "using" statements automatically. But the good news is that it is easy to get this functionality back:

    1. Start up RegEdit.exe
     

    2. Open HKEY_LOCAL_MACHINE\Software\Microsoft\VisualStudio\8.0\Packages\{A066E284-DCAB-11D2-B551-00C04F68D4DB}\SatelliteDLL

     

    3. Edit the "Path" value and change it from "C:\Program Files\Microsoft Visual Studio 8\VC#\VCSPackages\1033\" to "C:\Program Files\Microsoft Visual Studio 8\VC#\VCSPackages\"

     

    4. Then open a console window, go to the directory where your Visual Studio devenv.exe is located (e.g. C:\Program Files\Microsoft Visual Studio 8\Common7\IDE) and run
     
     devenv /setup /resetuserdata /resetsettings
    (be patient, this takes a while).

     

    See http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=419975&SiteID=1

  • Linq CTP May 2006 is released

  • Difference between SQL and .Net Framework built in functions

    A common problem when using different programming languages like SQL on the server and C# or VB on the client is that certain functions are almost the same, but not completely. A good example is SQL Round vs. CLR Math.Round.

    For example, rounding to the next integer would round 2.1 and 2.499 to 2, 2.501 and 2.987 to 3, but different implementations do different things with 2.5:

    On SQL Server, Round always rounds up a trailing 5 (where “up” means that for positive numbers, the result is greater. For negative numbers, only the absolute value goes up: The result is actually lower).

    In the CLR Math class, Math.Round uses “Banker’s rounding”, which means that a trailing 5 is rounded either up or down such that the result is even.

     

    DLinq sits between the managed languages and SQL, it allows users to write an expression e.g. in C#, which is then executed as a SQL expression.
    Now DLinq has a problem: Should the semantics be the one of SQL or the one of C#? This has been debated in our team some while ago, and probably we will debate it again before we release the next CTP.

    Here are some options:

    1) Our current solution translates to the SQL built in function if its meaning is “reasonable close” to the .Net Framework function. So Math.Round translates to SQL’s Round function.

    2) We could translate Math.Round(x) to some SQL expression in x that behaves in the same way as Math.Round(x) on the client.

     

    3) We could have 1 as the default behavior, but add libraries of additional functions that behave the same on SQL and .Net, one that does the Banker’s rounding in both cases, one that does the SQL “always rounding up” in both cases.

     

    One reason we went with 1 instead of 2 is that the performance is much better than for a special expression that would replace the simple ROUND function.

    Since it could both be said that developers expect that Math.Round translates to SQL’s ROUND or that it behaves the same as CLR’s Math.Round on the client, we went for the simple and efficient solution.
    One argument against 3 is that it only solves the problem for users who know the problem and know where to look for these functions.

     

    What are your thoughts? Would you care at all? Do you strongly prefer another solution? Would you need additional libraries?

  • DLinq Providers

    The last weeks I have been working on DLinq "providers". The idea is to connect with DLinq to other databases than SQL Server, so we may need to use other mechanisms to connect to the database, need to generate slightly different SQL etc. As a first experiment I wrote a basic Jet provider (to access Access databases). It has not been decided whether and how this experiment will be made public, but I would be interested to know what customers would want.
    Obviously it would be interesting to look at other databases, and as a private hobbyist I probably would just download one of the freely available databases and try it, but being an employee of a software company, clicking "I Agree" on the "End User License Agreement" of another software company is a non-trivial operation... So besides Access I was looking at other more exotic possibilities, and one thing that occasionally comes up in customer interactions is Analysis Server. Of course, this has a different flavor than Access since it has more fundamental differences to SQL Server, but it is an interesting idea.
    I should make explicit that there are currently no plans for adding such a provider in any form, since right now other things (like accessing other databases) are much more important. But I am still curious how customers would see this, is that something you would use? How big would the benefit be for you?

  • Bio

    I joined Microsoft Germany in 1998, and moved to the Visual Studio Group in Redmond in 2000. The last two years I have been working in the C# product unit, currently in the DLinq team, and plan to blog about DLinq from time to time. The current web site for LINQ and DLinq is here: http://msdn.microsoft.com/netframework/future/linq/

     

    Before joining Microsoft I studied mathematics in Bonn, Germany and Cambridge, UK and worked as a mathematician at

    Max-Planck-Institut für Mathematik, Bonn, Germany,

    Institute for Advanced Study, Princeton, NJ,

    Katholische Universität Eichstätt, Germany.

    As a hobby, I am still interested in some mathematical aspects of Computer Science, and may also blog about that.


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker