python lxml etree xpath - get information from none-static-tree
I am currently moving a german free-hosted forum into phpbb and have
chosen to do so by parsing the forum-html files with lxml etree and
python.
so far, the project is going really well and is almost finished.
However, as usual, a few snags have been thrown my way.
But first, what do I do: wget the html files from
http://www.razyboard.com/system/user_Fruchtweinkeller.html process the
files with tagsoup -nons run my script
the script will:
use etree.parse (xml)
use the xpath as allposts:
'/html/body/table[tr/td[@width="20%"]]/tr[position()>1]'
loop through all instances (for p in allposts) of the xpath and
get the content of the post with
etree.tostring(p.find("td[2]/p"),pretty_print=True)
put the content of the post through a certain function that will clean it
up (text.replace etc)
A lot snags that I encountered I could deal with. For example does this
forum generate html files that somehow suddenly add a i or b or u to the
xpath. so sometimes I get td[2]/i/p
The one snag I can not see an easy solution so far however is that if
certain codes have been used in the post (like left/center/right), it will
not be within my xpath, but following it.
I have semi-solved this by
getting td[2]/p
trying to get a following td[2]/div and
trying to get a following td[2]/ul[*]
here is an example:
http://www.razyboard.com/system/morethread-die-rezeptarchive-fruchtweinkeller-545668-1485255-0.html
But this is a crappy solution. First of all, I might miss content that I
have not seen but does use a similar "after p" approach. Even worse is
that after the div or ul there may be more text in the post without
certain tags that will definately be overlooked.
And I am not writing a script so that I can check 18000 files and 130k
posts if they were imported correctly.
One theoretical solution I could think of would be to get the whole td[2]
and dismiss everything before p (which I do not really know how) and after
the text "Signatur" Since not every post has a signatur this may cause
trouble.
Worst however will be the data I get. With my current system I get good
text without too many tags that I will have to remove. When I get the
whole td[2] this will no longer be the case and it would need extended
rewrite of my cleantext function (which I may be or may not be
knowledgeable enough)
note: I have rudimentary coding experience in c, pascal (yeah, Im old) and
bash. This is my first time with python and xpath. I have learned a lot
but will admit, lots was try and error and/or google...
I have tried to look into lxml html insted xml and could not see any
advantage of switching over.
Now, I can maybe hope someone more experienced with lxml, xml and xpath
reads this, looks at the post(s) of the forum and immediately sees a good
solution for me. It would be very welcome.
Since this is not likely, the only other avenue would be to give me hints
and pushed in the right direction of how to get the whole td[2] and remove
everything that I do not need (top and bottom) and give a similar clean
text as if I was just getting td[2]/p
Thanks in advance!
No comments:
Post a Comment