trying to refactor a xlst-processor to do better work

on opensuse 13.1 i try to do some gis-works with a large file: france-latest.osm.bz2 which i gathered from here: http://download.geofabrik.de/europe.html

what do i do with that file france-latest.osm.bz2

    bzcat france-latest.osm.bz2

what is aimed? i want to extract all things that belong to the POI restaurant which is
long lat
name
adress
etc - etx.

i have the following things up and running:

package perl-XML-Twig and run xml_split

with a command available on openSUSE to split xml files named xml_split (it is part of the package perl-XML-Twig) Now we try to run the following command (I hope we have enough hard disk space since the output is roughly 20GB).

 bzcat france.osm.bz2 | xml_split -s 100M -b france -n 3 -

this will result in a bunch of 100 Mb large xml files france-001.xml,france-002.xml and so on. Weu then have the xslt (the name of the root element) and of course we will need a loop in the bash to process the several files and collect all the results together.


<xsl:stylesheet version = '1.0'
        xmlns="http://www.w3.org/1999/xhtml"
        xmlns:xml_split="http://xmltwig.com/xml_split"
        xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

    <xsl:output method="text" encoding="UTF-8"/>
    <xsl:template match="/">

            <xsl:for-each select="xml_split:root/node/tag@k='amenity' and @v='restaurant']">
            <xsl:value-of select="../@id"/>
            <xsl:text>	</xsl:text>
            <xsl:value-of select="../@lat"/>
            <xsl:text>	</xsl:text>
            <xsl:value-of select="../@lon"/>
            <xsl:text>	</xsl:text>
            <xsl:for-each select="../tag@k='name']">
                <xsl:value-of select="@v"/>
            </xsl:for-each>
            <xsl:text>
</xsl:text>
        </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>



question: what do i need to get all the aimed data out of the dataset - i.e.

long lat
name
adress
etc - etx.

here below we have a data-chunk out of the xml-file that we have parsed: see it


    <node id="52768810" lat="48.2044749" lon="11.3249434" version="7" changeset="9490517" user="wheelmap_visitor" uid="290680" timestamp="2011-10-07T20:24:46Z">
        <tag k="addr:city" v="Olching" />
        <tag k="addr:country" v="DE" />
        <tag k="addr:housenumber" v="72" />
        <tag k="addr:postcode" v="82140" />
        <tag k="addr:street" v="Hauptstraße" />
        <tag k="amenity" v="restaurant" />
        <tag k="cuisine" v="mexican" />
        <tag k="email" v="info@cantina-olching.de" />
        <tag k="name" v="La Cantina" />
        <tag k="opening_hours" v="Mo-Su 17:00-01:00" />
        <tag k="phone" v="+49 (8142) 444393" />
        <tag k="website" v="http://www.cantina-olching.com/" />
        <tag k="wheelchair" v="no" />

well - how to get all the data out of the above mentioned file with the xslt-processingon opensuse 13.1 i try to do some gis-works with a large file: france-latest.osm.bz2 which i gathered from here: http://download.geofabrik.de/europe.html

what do i do with that file france-latest.osm.bz2

    bzcat france-latest.osm.bz2

what is aimed? i want to extract all things that belong to the POI restaurant which is
long lat
name
adress
etc - etx.

i have the following things up and running:

package perl-XML-Twig and run xml_split

with a command available on openSUSE to split xml files named xml_split (it is part of the package perl-XML-Twig) Now we try to run the following command (I hope we have enough hard disk space since the output is roughly 20GB).

 bzcat france.osm.bz2 | xml_split -s 100M -b france -n 3 -

this will result in a bunch of 100 Mb large xml files france-001.xml,france-002.xml and so on. Weu then have the xslt (the name of the root element) and of course we will need a loop in the bash to process the several files and collect all the results together.


<xsl:stylesheet version = '1.0'
        xmlns="http://www.w3.org/1999/xhtml"
        xmlns:xml_split="http://xmltwig.com/xml_split"
        xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

    <xsl:output method="text" encoding="UTF-8"/>
    <xsl:template match="/">

            <xsl:for-each select="xml_split:root/node/tag@k='amenity' and @v='restaurant']">
            <xsl:value-of select="../@id"/>
            <xsl:text>	</xsl:text>
            <xsl:value-of select="../@lat"/>
            <xsl:text>	</xsl:text>
            <xsl:value-of select="../@lon"/>
            <xsl:text>	</xsl:text>
            <xsl:for-each select="../tag@k='name']">
                <xsl:value-of select="@v"/>
            </xsl:for-each>
            <xsl:text>
</xsl:text>
        </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>



question: what do i need to get all the aimed data out of the dataset - i.e.

long lat
name
adress
etc - etx.

here below we have a data-chunk out of the xml-file that we have parsed: see it


    <node id="52768810" lat="48.2044749" lon="11.3249434" version="7" changeset="9490517" user="wheelmap_visitor" uid="290680" timestamp="2011-10-07T20:24:46Z">
        <tag k="addr:city" v="Olching" />
        <tag k="addr:country" v="DE" />
        <tag k="addr:housenumber" v="72" />
        <tag k="addr:postcode" v="82140" />
        <tag k="addr:street" v="Hauptstraße" />
        <tag k="amenity" v="restaurant" />
        <tag k="cuisine" v="mexican" />
        <tag k="email" v="info@cantina-olching.de" />
        <tag k="name" v="La Cantina" />
        <tag k="opening_hours" v="Mo-Su 17:00-01:00" />
        <tag k="phone" v="+49 (8142) 444393" />
        <tag k="website" v="http://www.cantina-olching.com/" />
        <tag k="wheelchair" v="no" />

well - how to get all the data out of the above mentioned file with the xslt-processing

Hi,

For both your re-factoring questions,

If I can suggest another more contemporary approach to what you are trying to do…
What you’re doing now is a fairly traditional approach to data… and you’re doing everything by hand.

You might find it interesting to do a search and read on Big Data Analysis…
For several years now, and ever accelerating over the past couple there are some fantastic tools that do the type of stuff you’re trying to do, but take it several steps further. Instead of focusing simply on prepping the data which you’re doing now (with the objective to later import into some app which you haven’t identified), the current leading edge approach is to start with the overall solution, and then configure components to perform various aspects of the overall solution. The result is a true end-to-end homogenous understanding what needs to be done to take raw data, parse and transform as needed for the later stages <and> including actual analysis or search.

So, when you start looking at these kinds of solutions, you’ll generally find over 90% of solutions are based on the Hadoop/Solr/Pig/Hive software stack (with possible substitutions for each of those individual pieces of software). The underlying magic is provided by Lucene but that is very hard to work with directly (is why the applications I listed are used).

Personally, I’m using an alternative stack which does much of the same as the traditional stack but completely re-engineered from the ground up about 4 years ago (and still evolving today). Like the traditional stack, the Elasticsearch/Logstash stack I’m using (with multiple options) standardizes on JSON to communicate between all software, and is also typically the basis for creating queries and returning results. Since JSON is utilized elsewhere and many other software understands JSON, this is a quicker to learn and very flexible architecture.

So, getting a bit back to what you’re currently doing…
Instead of trying to bother with the details of parsing and transforming(manipulating the data, perhaps adding tags, ids, separating, more), I use Logstash which basically applies grok patterns to data to separate each entry into individual fields, then apply transforms, then exports to some other app. If a pattern doesn’t already exist (This is important! If your data is a standard format, someone else may already have done the work for you!), you can create your own and maybe even publish as a “Plugin”

Typically, the JSON-formatted data is then streamed directly to Elasticsearch which is an <unstructured database> (I highly recommend you read up and understand why this is so much powerful than using a RDBMS), or to a queuing app like Redis before ES.

Once the data is in ES, then you can use graphical tools like Kibana or Graphite to query the data in ES, or using the QueryDSL (again, based on JSON) query from a command line (can be a Curl command or a command line embedded in a web page).

Anyway,
That’s a very brief thumbnail to how most of the leading edge solutions are being built.
I haven’t updated my pages for the latest implementations of ES and Logstash, but here is the starting point I wrote up. Recommend doing a quick read even if the info isn’t currently accurate
http://en.opensuse.org/User:Tsu2/elasticsearch_1.0
http://en.opensuse.org/User:Tsu2/elasticsearch_logstash_official_repos

A brief thumbnail of current issues off the top of my head…

  • Don’t try to install/run Logstash from the repos at the moment. Instead download the Tar, extract (to any location and run). They just tried to create an installable package instead of a standalone binary and it won’t run on a lot of distros including openSUSE (but the extracted TAR should be completely self-contained with all dependencies)
  • Do install Elasticsearch 1.0 from the repos as I’ve described. Works great.
  • The official tutorials/documentation on ES and Logstash as usual are practically unintelligible to the brand new User. Try to get what you can from it, but maybe you can find a 3rd party tutorial (but ES 1.0 and LS 1.4 are very new with major changes, so YMMV). In particular, skim my first link where I try to cover a number of notable ideas not obvious in the <old> Logstash .90 tutorials (but the LS 1.4 tutorials go in a completely different direction introducing completely new concepts).

You can PM me or post somewhere in these Forums if you have questions about this software, or post to the Google Groups for that software.

IMO and HTH,
TSU