Regular expressions, HTML and turning breaks into paragraphs

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

Althain
Posts: 6
Joined: Mon Apr 05, 2010 1:58 am UTC

Regular expressions, HTML and turning breaks into paragraphs

Postby Althain » Thu May 10, 2012 2:02 pm UTC

Some background: There is a fair amount of fanfiction out there I enjoy reading. What I don't always have is an internet connection or the desire to read this fanfiction on the original site - I much prefer converting it to EPUB and reading it on my tablet instead. Unfortunately, both EPUBs and EPUB readers are very finicky things, especially when it comes to trying to get consistent and comfortable formatting. I like being able to override the internal CSS in order to format all the ebooks I read the same way, but for that I need consistent structure in the text.

So, the problem: The particular HTML I'm trying to parse uses line breaks instead of paragraphs (and empty divs for indentation). Sometimes there will be no space between paragraphs:

Code: Select all

This is paragraph 1.<br />
This is paragraph 2, <i>with some formatting</i>.<br />
This is the paragraph just before a scene break.<br />
<br />
This is the start of a new paragraph with different break tags.
<br>

Sometimes there will:

Code: Select all

This is paragraph 1.<br />
<br />
This is paragraph 2, <i>with some formatting</i>.<br />
<br />
This is the paragraph just before a scene break.<br />
<br />
<br />
This is the start of a new paragraph with different break tags.
<br>
<br>

And in both cases they may use extra breaks as scene/section dividers.

I would like to put each paragraph in paragraph tags. For the moment I'm sufficiently happy to use hard breaks for scene/section dividers. (Of course, some authors use asterisks or, if they're being particularly kind, horizontal rules.)

Code: Select all

<p>This is paragraph 1.</p>
<p>This is paragraph 2, <i>with some formatting</i>.</p>
<p>This is the paragraph just before a scene break.</p>
<br />
<p>This is the start of a new paragraph with different break tags.</p>


Now, I'm well ware that regular expressions can be a nightmare for parsing HTML, and normally (and indeed, throughout the application) use the Beautiful Soup module. However, just for the moment at least, I'm trying to solve this single particular problem with regex.

Here's the Python regex so far:

Code: Select all

(?P<paragraph>.*?)\s*?(?:<[/\s]*br[/\s]*>[\s]*)+

(?P<paragraph>.*?)            Matches the entire paragraph in a group with name 'paragraph'. Non greedy, so this shouldn't capture any <br>.
\s*                           Strips excess whitespace from the end of the paragraph, greedy.
(?:<[/\s]*br[/\s]*>[\s]*)+    Within the (non captured) group, matches any of the variations of <br> followed by whitespace that I can think of. Matches at least one or more.

I then use the findall and sub methods to replace it with <p> tags around the 'paragraph' group.

Now, this actually works pretty well on all the test and real world cases I've tried. The only issue, as you can probably see, is that this obilterates all <br> tags, even the ones I want to use as scene dividers. I think what I want to do is use a negative lookahead or lookbehind assertion to ensure that no <br> tags appear in the 'paragraph' group and change the + on the <br> group to {n,n}, where n is the number of breaks delimiting a paragraph (which would of course have to be determined programmatically for the general case). However I've gotten quite frustrated trying to construct the expression as even my conceptual tests aren't working correctly.

Any advice on how to solve this problem, with or without regex (but still necessarily in Python, or Calibre at worst), would be greatly appreciated.

User avatar
diabolo
Posts: 72
Joined: Fri Aug 08, 2008 4:17 pm UTC
Location: france

Re: Regular expressions, HTML and turning breaks into paragr

Postby diabolo » Thu May 10, 2012 2:35 pm UTC

Althain wrote:Now, this actually works pretty well on all the test and real world cases I've tried. The only issue, as you can probably see, is that this obilterates all <br> tags, even the ones I want to use as scene dividers.

Have you tried handling the scene dividers in a first pass (converting multiple <br>s, asterisks, ... to <hr>) before dealing with the paragraphs?

Token
Posts: 1481
Joined: Fri Dec 01, 2006 5:07 pm UTC
Location: London

Re: Regular expressions, HTML and turning breaks into paragr

Postby Token » Sat May 12, 2012 11:14 am UTC

Regular expressions are a powerful tool, but to use them properly you need to have a good understanding of both their strengths and their limitations. It's well known that they are not capable of parsing arbitrary HTML (the "regular" in "regular expression" has a technical meaning, and HTML is not a regular language). That's not the same thing as "regular expressions are useless in any situation involving HTML", though, and I think you've correctly identified that this is a problem that responds well to attack using regular expressions.

However, you're falling into the same general trap that a lot of regular expression users do, which is trying to make your regular expression do too much. Regular expressions are incredibly useful for finding substrings of a larger string that match a pattern that allows for some level of inconsistency. As soon as you try and make them do more than that, you're going to run into difficultly, because even when they are capable of solving the problem, they tend to do so in a complicated way that will therefore be difficult to (a) find in the first place, (b) debug or(c) modify /extend.

Regular expressions will do a wonderful job of finding the break tags in your text. What they will not do very well is to determine where the section dividers go. I mean, you could almost certainly get them to, but you have no reason to when you have a better language at your disposal (Python) to do it instead. Here is how I would recommend you approach the problem:

Use regular expressions to locate break tags in the text, and split it at those points. You will then have a sequence of strings, some of which will be non-empty and non-whitespace (defining your paragraph contents), and some of which will be empty or whitespace (defining your section dividers according to how many there are in a row). Iterate through this sequence and build up your output HTML step-by-step. This will be a lot simpler and clearer than a pure regular expression solution.
All posts are works in progress. If I posted something within the last hour, chances are I'm still editing it.

D-503
Posts: 84
Joined: Sun Apr 15, 2012 11:35 pm UTC

Re: Regular expressions, HTML and turning breaks into paragr

Postby D-503 » Sun May 13, 2012 12:51 am UTC

Here's an idea, there's a website called ScraperWiki (scraperwiki.com) where you can develop and share Python/Ruby/PHP scrapers. A scraper is a program that converts unstructured documents (generally HTML) into structured (usually tabular) data.
Converting HTML to eBook formats would be an atypical use case, but I think an interesting one if you could make it work. Also, you might be able to learn some nice parsing techniques by looking at the other scrapers.

Althain
Posts: 6
Joined: Mon Apr 05, 2010 1:58 am UTC

Re: Regular expressions, HTML and turning breaks into paragr

Postby Althain » Sun May 13, 2012 4:45 am UTC

I think the trap I fell into was having mostly solved the problem with a single regex and then hoping I could squeeze in that little extra functionality as well. For what it's worth I kind of managed to hack together an expression that worked, but it is ugly and terrible and all sorts of bad things.

Token wrote:Use regular expressions to locate break tags in the text, and split it at those points. You will then have a sequence of strings, some of which will be non-empty and non-whitespace (defining your paragraph contents), and some of which will be empty or whitespace (defining your section dividers according to how many there are in a row). Iterate through this sequence and build up your output HTML step-by-step. This will be a lot simpler and clearer than a pure regular expression solution.

This is one of those solutions that seems blindingly obvious once it's been pointed out. I'll go ahead and implement this instead.

Thank you all for your suggestions.

If anyone's curious, this is part of a larger project that I'm a (small) part of, FanFictionDownLoader. We have CLI, Web and Calibre plugin versions if you want to try it out.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 8 guests