Python hacking – make ElementTree support line number

An easy way to parse XML in Python is using Python xml.etree.ElementTree, which parses the XML document/data into a tree structure, where each node is an Element object. Only within few lines of code, one can extract all the XML information including tag and attributes. ElementTree looks perfect until the line number of certain tag block within the original XML file is wanted. In this post, we will hack into the Python source file, add line number support for the ElementTree, rebuild the Python and demonstrate our awesome hacking using simple testing script, just 4 FUN! K.R.K.C.

0. Question: can we get the line number using xml.etree.ElementTree?

http://stackoverflow.com/questions/6949395/is-there-a-way-to-get-a-line-number-from-an-elementtree-element

1. Analysis

https://docs.python.org/2/library/xml.etree.elementtree.html gives the detailed usage of ElementTree. Apparently, it does not support line number. On one hand, this is reasonable – why do we need line number if the XML is parsed into a tree structure. On the other hand, while tree structure saves most of the information from the XML file, it loses the line number information for certain tag block within the original file, which may be useful for debugging.

https://docs.python.org/2/library/pyexpat.html#module-xml.parsers.expat shows one way to get the line number information using xmlparser.CurrentLineNumber. However, we do not want to deal with the low-level parser and we want to have the line number from the ElementTree. Can we do that? Yes, hack Python!

2. R.t.D.C.

http://hg.python.org/cpython/file/f2e6c33ce3e9/Lib/xml/etree/ElementTree.py provides the latest ElementTree implementation within the Python 2.7.8 implementation. To parse a XML file, we only need to call ElementTree.parse(‘filename’) within the application. Let us find that parse() method. Because we do not determine the parser option, ElementTree.parse() method would make up a default parser for us:

parser = XMLParser(target=TreeBuilder())

Then let up jump into XMLParser class within the same source file. Looking at the constructor function, we will find the XML parser we are going to use to expat, a C-based XML parser library. (pyexpat is just a Python wrapper for the expat implementation).

parser = expat.ParserCreate(encoding, "}")

Once the parser is created, all the following work is setting different callbacks for the expat parser. These callbacks tell the expat how to process the parsed XML information. There is a one callback we are especially interested in.

parser.StartElementHandler = self._start

This callback is used to tell expat how to handle each new element/node parsed from the XML data. The start key word here stands for the starting point of this element/node. Let us go to the XMLParser._start() method, which will call the target.start() method.

return self.target.start(tag, attrib)

What is target.start() method? Remember the target argument we passed into the XMLParser constructor? That is it. The target is the TreeBuilder class within the same source file. Let us jump into TreeBuilder.start() method.

self._last = elem = self._factory(tag, attrs)

Check the constructor of TreeBuilder, we will find the _factory() method by default is the Element class. Now the story happens here is pretty clear: The XMLParser parses the XML data and saves the information as Element into a tree structure. The application can extract these information from Element object directly. Then a reasonable place to place the line number is the Element class too. Cool, we have done! Wait…but where does the line number come from? Remember the parser we have created in XMLParser constructor? It has a member called self._parser, which is exactly the xmlparser object, which means…(leave me a msg if it is not clear…)

3. Hack

3.1. Download the Python source file

https://www.python.org/download/
https://www.python.org/ftp/python/2.7.8/Python-2.7.8.tgz (I am a fan of Python 2.X)

3.2. Hack the ElementTree

Python-2.7.8/Lib/xml/etree/ElementTree.py, where we will:

add line number into the Element class and update the corresponding constructor and other methods which will create a new Element instance;
update the TreeBuilder.start() method to have an extra line number argument and call our new Element constructor with line number support;
update the XMLParser._start() method to inject the line number and call our updated TreeBuilder.start() method to create new Element with line number

4. Test

Configure and make your new Python. Then use our new Python to test a simple ElementTree application. Here it goes~

[root@daveti test]# cat pyxml.py
import xml.etree.ElementTree as ET

tree = ET.parse(‘country_data.xml’)
root = tree.getroot()

for e in root:
print e.lineNum, e.tag, e.attrib
[root@daveti test]# /root/python/Python-2.7.8/python ./pyxml.py
3 country {‘name’: ‘Liechtenstein’}
10 country {‘name’: ‘Singapore’}
16 country {‘name’: ‘Panama’}
[root@daveti test]#

5. Code – pyet

All the code is available at my github and follows GPLv2.
https://github.com/daveti/pyet

About daveti

Interested in kernel hacking, compilers, machine learning and guitars.
This entry was posted in Programming and tagged , , , , . Bookmark the permalink.

16 Responses to Python hacking – make ElementTree support line number

  1. Work hard to make it simple, man

  2. ask fm Hack says:

    magnificent publish, very informative. I’m wondering
    why the opposite experts of this sector do not notice this.
    You should continue your writing. I am confident, you’ve a huge readers’ base already!

  3. This site was… how do I say it? Relevant!! Finally I have found something which helped me.
    Thanks!

  4. Good site you have got here.. It’s difficult to find excellent writing like yours these days.
    I really appreciate people like you! Take care!!

  5. lol Hack says:

    Quality articles is the key to invite the users to pay a quick visit the web site,
    that’s what this web page is providing.

  6. ask fm Hack says:

    I have read so many posts regarding the blogger lovers except this post is really
    a pleasant paragraph, keep it up.

  7. Hi there fantastic website! Does running a blog like this take a large amount of
    work? I’ve virtually no expertise in computer programming however I had been hoping to start my
    own blog soon. Anyhow, should you have any recommendations or tips for new blog owners please share.

    I understand this is off topic nevertheless I just wanted
    to ask. Thanks!

  8. I couldn’t refrain from commenting. Very well written!

  9. Hey fantastic website! Does running a blog similar to this take a large amount of work?
    I have virtually no knowledge of programming but I was hoping to start my own blog in the near future.
    Anyhow, should you have any suggestions or tips for new blog owners please share.
    I understand this is off subject nevertheless I just wanted to ask.
    Appreciate it!

  10. It’s fantastic that you are getting thoughts from this piece of writing as well as from our
    dialogue made here.

  11. Johng311 says:

    I genuinely enjoy studying on this website, it holds good content. Never fight an inanimate object. by P. J. O’Rourke. fedecadbedad

  12. Hello! I know this is kind of off-topic however I needed to ask.
    Does building a well-established blog like yours take a large amount of work?
    I am brand new to running a blog however I do write in my diary every day.
    I’d like to start a blog so I can easily share my personal experience and thoughts online.
    Please let me know if you have any suggestions or tips for new aspiring bloggers.
    Appreciate it!

  13. Thanks , I have just been looking for info about this subject for a
    while and yours is the greatest I have came upon till now.
    However, what concerning the conclusion? Are you sure concerning the supply?

  14. I do agree with all of the ideas you’ve presented for your post.
    They’re really convincing and will certainly work.
    Still, the posts are too quick for beginners. Could you please
    extend them a little from next time? Thank you for the
    post.

  15. I have fun with, cause I discovered exactly what I used to be taking a look for.
    You’ve ended my four day lengthy hunt! God Bless you man. Have a great
    day. Bye

  16. Thank you for sharing your thoughts. I truly appreciate your efforts and I will be waiting for your next
    write ups thank you once again.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s