Perl and XML parsing

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

myrcutio
Posts: 44
Joined: Wed Dec 30, 2009 7:28 pm UTC

Perl and XML parsing

Postby myrcutio » Tue Jan 03, 2012 7:36 pm UTC

This is a question that I think probably has an elegant solution that I'm overlooking. I'm trying to write a script that reads in a large xml file with multiple elements called <entry>, and can write the values contained in each entry element to a row in a sql database. I got close with the xml::simple module, but it's doing some confusing things with arrays and hashes when there are multiple elements with identical tags.

here's the basic code that i'm starting with:

Code: Select all

use XML::Simple;
use Data::Dumper;

$xml = XML::Simple->new (ForceArray => 1);
$data = $xml->XMLin("blog.xml");

print "$data->{entry}->[0]->{id}\n";


Here's a simplified blog.xml file I'm testing with:

Code: Select all

<?xml version='1.0'?>
<feed>
        <id>11235813</id>
        <author>
                 <name>myrcutio</name>
                 <email>foo@fum.com</email>
        </author>
        <entry>
                <id>2</id>
                <postdate>1854-12-09T09:50:27</postdate>
                <title>Balaclava Sundae</title>
                <content>Storm'd at with shot and shell</content>
        </entry>
        <entry>
                <id>1</id>
                <postdate>1854-12-09T08:47:35</postdate>
                <title>Crimean Pentathlon</title>
                <content>Boldly they rode and well</content>
        </entry>
</feed>


The xml I'll actually be parsing is pretty large though; it's a few years worth of blog entries. The structure is the same though.

I can read in the base elements just fine using $data->{id}; or $data->{author}->{name}; but it seems to nest all the <entry> elements in a single nested hash. I don't really understand what it's doing at that point, it seems like it should just make a hash array with each array element one of the <entry> objects, and I do get an array when I try print $data->{entry};
but if I try the following it only returns a single hash:

Code: Select all

@entries = $data->{entry};
print @entries;


What should I do?
"Lightning must have hit it, and now it won't work in anything but Windows 95."

Faustus runs afoul of Microsoft.

Carnildo
Posts: 2023
Joined: Fri Jul 18, 2008 8:43 am UTC

Re: Perl and XML parsing

Postby Carnildo » Wed Jan 04, 2012 4:38 am UTC

myrcutio wrote:I can read in the base elements just fine using $data->{id}; or $data->{author}->{name}; but it seems to nest all the <entry> elements in a single nested hash. I don't really understand what it's doing at that point, it seems like it should just make a hash array with each array element one of the <entry> objects, and I do get an array when I try print $data->{entry};
but if I try the following it only returns a single hash:

What should I do?


How familiar are you with Perl references and complex data structures? Using Data::Dumper, I get the following data structure from your example file:

Code: Select all

$VAR1 = {
          'entry' => [
                     {
                       'content' => 'Storm\'d at with shot and shell',
                       'postdate' => [
                                     '1854-12-09T09:50:27'
                                   ],
                       'title' => [
                                  'Balaclava Sundae'
                                ],
                       'id' => [
                               '2'
                             ]
                     },
                     {
                       'content' => 'Boldly they rode and well',
                       'postdate' => [
                                     '1854-12-09T08:47:35'
                                   ],
                       'title' => [
                                  'Crimean Pentathlon'
                                ],
                       'id' => [
                               '1'
                             ]
                     }
                   ],
          'author' => [
                      {
                        'email' => [
                                   'foo@fum.com'
                                 ],
                        'name' => [
                                  'myrcutio'
                                ]
                      }
                    ],
          'id' => [
                  '11235813'
                ]
        };

$data->{entry} is a reference to an anonymous array of anonymous hash references. You can access the "Balaclava Sundae" entry as $data->{entry}->[0], and the "Crimean Pentathlon" entry as $data->{entry}->[1]. You can also perform the normal array actions on it, if you dereference the array reference first (eg. "grep {$_->{title}->[0] eq 'Crimean Pentathlon'} @{$data->{entry}})").

As a side note, using "ForceArray => 1" is making things harder than they need to be, by forcing all XML elements into array references. You'll make things easier by replacing that "1" with an anonymous array of the specific element names that you want to force into arrays (eg. "ForceArray => ['entry']").

Another thing to consider is that if your XML file is stateful (ie. it indicates changing authorship by interleaving "author" and "entry" elements), XML::Simple is the wrong tool for the job. You'll need to use a stream parser instead.

myrcutio
Posts: 44
Joined: Wed Dec 30, 2009 7:28 pm UTC

Re: Perl and XML parsing

Postby myrcutio » Wed Jan 04, 2012 7:31 am UTC

I got a different result when I ran data::dumper,

Code: Select all

$VAR1 = {
          'entry' => {
                     '1' => {
                            'postdate' => '1854-12-09T08:47:35',
                            'content' => 'Boldly they rode and well',
                            'title' => 'Crimean Pentathlon'
                          },
                     '2' => {
                            'postdate' => '1854-12-09T09:50:27',
                            'content' => 'Storm\'d at with shot and shell',
                            'title' => 'Balaclava Sundae'
                          }
                   },
          'author' => {
                      'email' => 'foo@fum.com',
                      'name' => 'myrcutio'
                    },
          'id' => '11235813'
        };

It seems to be detecting the id tag and grouping the entire element under it, rather than creating an array. I reworked my code and got to this point:

Code: Select all

use XML::Simple;
use warnings;

$xml = XML::Simple->new;
$data = $xml->XMLin("test.xml");

foreach (%{$data->{entry}}){
        print "\n$_->{title}\n";
}

but it gives this output:

Code: Select all

Use of uninitialized value in string at test.pl line 9.

Crimean Pentathlon
Use of uninitialized value in string at test.pl line 9.

Balaclava Sundae

I can add an if(exists $_->{title}) conditional around it to get it working, but I'm hoping there's a prettier solution. As for the script being stateful, (and I'm grossly unfamiliar with the term) I'm inclined to say it's not, since I only intended this script for a one-off migration and the author is the same for all entries.
"Lightning must have hit it, and now it won't work in anything but Windows 95."

Faustus runs afoul of Microsoft.

Carnildo
Posts: 2023
Joined: Fri Jul 18, 2008 8:43 am UTC

Re: Perl and XML parsing

Postby Carnildo » Thu Jan 05, 2012 3:50 am UTC

myrcutio wrote:It seems to be detecting the id tag and grouping the entire element under it, rather than creating an array.

You should be able to control this using the "KeyAttr" parameter of XML::Simple->new().

User avatar
phlip
Restorer of Worlds
Posts: 7569
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia
Contact:

Re: Perl and XML parsing

Postby phlip » Mon Jan 09, 2012 1:28 am UTC

Carnildo's suggestion will help you out with a real solution, but I thought I'd still explain what this is doing:
myrcutio wrote:

Code: Select all

foreach (%{$data->{entry}}){

You're passing foreach a hash, when it's expecting a list... in Perl, a hash is basically just a list of alternating keys and values. So if you do something like:

Code: Select all

%a = ('key1' => 'value1', 'key2' => 'value2');
print "$_\n" for (%a);
it will output something like:

Code: Select all

key1
value1
key2
value2
(possibly in another order, when I try it I get key2,value2,key1,value1, but always alternating key and associated value.)

If you want to list through all the keys, or all the values, you can use "for (keys %hash)" or "for (values %hash)". Or you can follow Carnildo's suggestions and not get a hash in the first place.

Code: Select all

enum ಠ_ಠ {°□°╰=1, °Д°╰, ಠ益ಠ╰};
void ┻━┻︵​╰(ಠ_ಠ ⚠) {exit((int)⚠);}
[he/him/his]

User avatar
shawnhcorey
Posts: 42
Joined: Sun Jan 08, 2012 2:08 pm UTC

Re: Perl and XML parsing

Postby shawnhcorey » Mon Jan 09, 2012 3:59 pm UTC

myrcutio wrote:What should I do?


Switch to XML::Twig. XML::Simple does not preserve the structure of an XML document; XML::Twig does.

myrcutio
Posts: 44
Joined: Wed Dec 30, 2009 7:28 pm UTC

Re: Perl and XML parsing

Postby myrcutio » Wed Jan 11, 2012 11:11 pm UTC

It's working well at the moment, but I couldn't get the syntax right for "for (keys %data)". I tried a few different variations like %$data and {$data} but I eventually gave up ( for (@{$data->{entry}}) seems to work fine) I'm probably getting my arrays and hashes mixed up somewhere, but since it works I'll leave it alone for now. Here's the working version btw, in case anyone is interested.

Code: Select all

#!/usr/bin/env perl
use DBI;
use XML::Simple;
use warnings;

$host = 'DBI:mysql:testschema;host=www.foobar.com';
$tbl = "testupload";
$usr = "xxxx";
$pwd = "xxxxxxxx";

$dbh = DBI->connect($host, $usr, $pwd, { RaiseError => 1 });
$sth = $dbh->prepare("INSERT into $tbl(blog_title, blog_content, blog_posteddate) values( ?, ?, ?)");

$file = "test.xml";
$xml = XML::Simple->new(KeyAttr => "");
$data = $xml->XMLin($file);

for (@{$data->{entry}}){
    if($_->{category}->{term} =~ m/post/ ){
        $title = $_->{title}->{content};
        $content = $_->{content}->{content};
        $published = $_->{published};

        $sth->execute($title, $content, $published);
    }
}


Thanks for all the advice, I'll try to build it with XML::Twig next time, which may be fairly soon. I have another related script I'm working on to parse the result of a google books api query, and i can't quite figure out how to get the string that's returned into a usable data structure. Here's what it looks like https://www.googleapis.com/books/v1/volumes?q=9780765320520

It looks really similar to the data dump I got with the xml parser, but I can't quite get it to the point where I can reference a specific field (I'd like to avoid ugly regex). My first few attempts were to use split(/":/, $apistring) with a few variations of the split, but I suspect this isn't the correct method. Any advice?
"Lightning must have hit it, and now it won't work in anything but Windows 95."

Faustus runs afoul of Microsoft.

User avatar
phlip
Restorer of Worlds
Posts: 7569
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia
Contact:

Re: Perl and XML parsing

Postby phlip » Thu Jan 12, 2012 1:06 am UTC

myrcutio wrote:It's working well at the moment, but I couldn't get the syntax right for "for (keys %data)". I tried a few different variations like %$data and {$data} but I eventually gave up ( for (@{$data->{entry}}) seems to work fine) I'm probably getting my arrays and hashes mixed up somewhere, but since it works I'll leave it alone for now.

I think what you'd want is something like:

Code: Select all

for (keys %{data->{'entry'}})
{
  $title = $data->{'entry'}->{$_}->{'title'}->{'content'};
  # etc
}
If you don't need the ids, then you can shortcut this to:

Code: Select all

for (values %{data->{'entry'}})
{
  $title = $_->{'title'}->{'content'};
  # etc
}

Code: Select all

enum ಠ_ಠ {°□°╰=1, °Д°╰, ಠ益ಠ╰};
void ┻━┻︵​╰(ಠ_ಠ ⚠) {exit((int)⚠);}
[he/him/his]


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 12 guests