davejingtian.org | Still got dreams; still got hopes… | Page 6

Bye Eugene and Hi Gainesville – Yet another new start in life

Posted on December 14, 2014 by daveti

I came across with the picture below and I decided to use it for this post, with the proud of being a DUCK. FYI, except this picture, this post is NOT related with any technologies or football. Save your time if you have got more fun to enjoy!

Life changes all the time, whether from good sides or bad sides. After 2 years as a Ph.D. student of CIS department at the University of Oregon (UO), I have just arrived at Gainesville, continuing my Ph.D. program of CISE department at the University of Florida (UF). Again, I have no idea if this will be good or bad. And eventually, there is no such thing which is designed to be good or bad by God, according to Quantum Physics. Depending on your ‘interference’, it will ‘display’ either good or bad.

Things have been changed a lot in my 2014, such as my age. I am not a Duck anymore. My father is in his 60th and looks more healthy than myself. I am married. My brother cried when I left for the U.S.. Fubao (our dog) starts to miss me. After sleeping in the couch of my friend’s apartment in Eugene for a week, I have just found another comfy couch in my friend’s house in Gainesville. I am still looking for a new place to live in Gainesville.

What is meaning of life? What do we want during the short time of life and is it worthy? Actually, I have stopped thinking about answers to these questions. Looking back, I have realized all I need to do is moving forward. There is alway something, which seems blur while so clear, directing my life. Unlike the Blues, it is both hard to feel and play. This sounds unfair given the short time of our life. On the other hand, this may be the only way to prove that we once truly lived.

To see a world in a grain of sand
And a heaven in a wild flower,
Hold infinity in the palm of your hand
And eternity in an hour.

— William Blake

一花一世界，
一沙一天国，
君掌盛无边，
刹那含永劫。

Posted in Life Prose | Tagged UF, UO | Leave a comment

Python Internals – Integer object pool (PyIntObject)

Posted on December 11, 2014 by daveti

Starting from this post, I will try to make a series of blogs on Python Internals, where Python object mechanism, Python bytecode (pyc) and Python VM are gonna be discussed. We will also talk about the limited resource online about Python internals. However, this series of posts owe its inspiring thoughts to the book “Python Source Code Analysis” by Robert Chen. Happy hacking!

0. Quiz

Let us start our journey from an ‘everyone-loved’ quiz. Relax. It is simple to read but tricky to think. There are 2 code snippets down there. Which one would you think runs faster?

# Small Integers
for j in range(0, 100000):
	for i in range(0, 256):
		pass

# Big Integers
for j in range(0, 100000):
	for i in range(1000, 1256):
		pass

Actually, I can make it even more ‘fun’ by adding one more code snippet down there. Now which one is the fastest? And the slowest?

# Bigger Integers
for j in range(0, 100000):
	for i in range(10000, 10256):
		pass

1. Answer

Theoretically, the speed comparison of these 3 snippets running follows this way: smaller > big == bigger. If this is a little bit confusing and surprising, please read on:)

2. PyIntObject

Like everyone knows, everything in Python is an object (PyObject) and Integer is not an exception. When we write down something like ‘a = 5’, Python tries to construct an Integer object (PyIntObject, one can think of it as a container) whose value is 5 internally. This construction usually involves ‘new’ in C++ and ‘malloc’ in C. As Integer in Python is immutable, almost all the integers have to have their own memory allocation even holding the same value (we will talk about why it is ‘almost’ here soon). Apparently, it is not a good design to call ‘malloc’ each time when an integer comes up and Python has been complained about its performance for a while. So how would we (Python) handle this case internally?

Like everyone knows (again?), the answer is integer object pool. This pool holds numbers of PyIntObjects which has been allocated already but not initialized (no valid value or type). Whenever a new integer shows up, a free and existing PyIntObject from the pool will be filled up and initialized without bothering the ‘malloc’. SoThe block_list is the pool and the free_list is the list of available PyIntObjects within the pool.

/* daveti: Python-2.7.8/Objects/intobject.c */
#define BLOCK_SIZE      1000    /* 1K less typical malloc overhead */
#define BHEAD_SIZE      8       /* Enough for a 64-bit pointer */
#define N_INTOBJECTS    ((BLOCK_SIZE - BHEAD_SIZE) / sizeof(PyIntObject))

struct _intblock {
    struct _intblock *next;
    PyIntObject objects[N_INTOBJECTS];
};

typedef struct _intblock PyIntBlock;

static PyIntBlock *block_list = NULL;
static PyIntObject *free_list = NULL;

But still, this does not explain the answer to the quiz. Let us move forward. Even though there is only one integer object pool. This pool is divided into 2 parts, small integers and big integers. Big integers are exactly the case mentioned above, where PyIntObject is used as a container for the integer and even integers having the same value have their own PyIntObject. However, small integers have a different story. Though in the same integer pool, small integers have been initialized with the small-integer values already. Following small integers will not trigger a new PyIntObject in the pool but refer to the same PyIntObject with the same value. This is a little bit confusing. Let us have an example here: for ‘a = 5’, there exists a PyIntObject with the value of 5 in the pool. So ‘a’ will ‘point’ to that PyIntObject. When ‘b = 5’ shows up, ‘b’ also ‘point’s to the same PyIntObject like ‘a’ does. However, for ‘a = 10000’, it is a big integer. Python will try to find an available PyIntObject from the pool and initialize it with the value of 10000. When ‘b = 10000’ comes up, a same procedure has to repeat again.

The difference between the small integers and the big ones also rises 2 questions – what is changed within the PyIntObject for small integers with the same value and what are small integers? As no new PyIntObject is requested from the pool, small integers with the same value, ‘point’ing to the same PyIntObject, would only change the reference count (refcount) within the PyIntObject. Unlike the big integers, small integers stay in the memory during the life of Python VM/interpreter, with the refcount never being 0. The definition of small integers are defined as below:

/* daveti: Python-2.7.8/Objects/intobject.c */
#ifndef NSMALLPOSINTS
#define NSMALLPOSINTS           257
#endif
#ifndef NSMALLNEGINTS
#define NSMALLNEGINTS           5
#endif
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
/* References to small integers are saved in this array so that they
   can be shared.
   The integers that are saved are those in the range
   -NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive).
*/
static PyIntObject *small_ints[NSMALLNEGINTS + NSMALLPOSINTS];

During the initialization of Python, small integers, which are also built up the integer object pool, are initialized with the range [-5, 257) and ready to use.

/* daveti: Python-2.7.8/Objects/intobject.c */
int
_PyInt_Init(void)
{
    PyIntObject *v;
    int ival;
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
    //daveti: init small/big int pool!
    for (ival = -NSMALLNEGINTS; ival < NSMALLPOSINTS; ival++) {
          if (!free_list && (free_list = fill_free_list()) == NULL)
                    return 0;
        /* PyObject_New is inlined */
        v = free_list;
        free_list = (PyIntObject *)Py_TYPE(v);
        PyObject_INIT(v, &PyInt_Type);
        v->ob_ival = ival;
        //daveti: small int pool is based on block_list (big int pool) as well!
        small_ints[ival + NSMALLNEGINTS] = v;
    }
#endif
    return 1;
}

When an integer from your program comes to play, Python would try to determine if this is a small integer, which means the corresponding PyIntObject is already there for use. Otherwise, it should be a big integer, which means we have to find a free PyIntObject from the pool and initialize it with the corresponding value.

/* daveti: Python-2.7.8/Objects/intobject.c */
PyObject *
PyInt_FromLong(long ival)
{
    register PyIntObject *v;
    //daveti: try to use small int pool
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
    if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS) {
        v = small_ints[ival + NSMALLNEGINTS];
        Py_INCREF(v);
#ifdef COUNT_ALLOCS
        if (ival >= 0)
            quick_int_allocs++;
        else
            quick_neg_int_allocs++;
#endif
        return (PyObject *) v;
    }
#endif
    if (free_list == NULL) {
        if ((free_list = fill_free_list()) == NULL)
            return NULL;
    }
    //daveti: use big int pool/block!
    /* Inline PyObject_New */
    v = free_list;
    //daveti: free_list gives the first available PyIntObject
    //then we use Py_TYPE(v) to update the free_list - to the next available position
    //Isn't this smart:)
    free_list = (PyIntObject *)Py_TYPE(v);
    //daveti: have to init it anyway because of missing invalid ob_type
    PyObject_INIT(v, &PyInt_Type);
    v->ob_ival = ival;
    return (PyObject *) v;
}

Now let us go back to the quiz. The outside loop among those code snippets is the same. The only difference is the inner loop, which has different ranges. For the small case, the range is [0, 256), which would be treated as small integers by Python, which means Python would do nothing but update the reference count within the existing PyIntObject during each loop. For the big case, the range is [1000, 1256), which would be treated as big integers by Python, which means Python would try to find a free PyIntObject from the pool and initialize it with the corresponding value during each loop. The bigger case is the same as the big one.

3. Tests

To prove our analysis, we need to make sure there is no difference between the small case and the big case, from the Python bytecode level:

Daves-MacBook-Pro-2:blog daveti$ python
Python 2.7.6 (default, Sep 9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
>>> def foos():
… for j in range(0, 100000):
… for i in range(0, 256):
… pass
…
>>> def foob():
… for j in range(0, 100000):
… for i in range(1000, 1256):
… pass
…
>>> import dis
>>> dis.dis(foos)
2 0 SETUP_LOOP 49 (to 52)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (0)
9 LOAD_CONST 2 (100000)
12 CALL_FUNCTION 2
15 GET_ITER
>> 16 FOR_ITER 32 (to 51)
19 STORE_FAST 0 (j)

3 22 SETUP_LOOP 23 (to 48)
25 LOAD_GLOBAL 0 (range)
28 LOAD_CONST 1 (0)
31 LOAD_CONST 3 (256)
34 CALL_FUNCTION 2
37 GET_ITER
>> 38 FOR_ITER 6 (to 47)
41 STORE_FAST 1 (i)

4 44 JUMP_ABSOLUTE 38
>> 47 POP_BLOCK
>> 48 JUMP_ABSOLUTE 16
>> 51 POP_BLOCK
>> 52 LOAD_CONST 0 (None)
55 RETURN_VALUE
>>> dis.dis(foob)
2 0 SETUP_LOOP 49 (to 52)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (0)
9 LOAD_CONST 2 (100000)
12 CALL_FUNCTION 2
15 GET_ITER
>> 16 FOR_ITER 32 (to 51)
19 STORE_FAST 0 (j)

3 22 SETUP_LOOP 23 (to 48)
25 LOAD_GLOBAL 0 (range)
28 LOAD_CONST 3 (1000)
31 LOAD_CONST 4 (1256)
34 CALL_FUNCTION 2
37 GET_ITER
>> 38 FOR_ITER 6 (to 47)
41 STORE_FAST 1 (i)

4 44 JUMP_ABSOLUTE 38
>> 47 POP_BLOCK
>> 48 JUMP_ABSOLUTE 16
>> 51 POP_BLOCK
>> 52 LOAD_CONST 0 (None)
55 RETURN_VALUE
>>>

Then let us go and profile them:

Daves-MacBook-Pro-2:python daveti$ cat small_int.py
# Small Integers
for j in range(0, 100000):
for i in range(0, 256):
pass
Daves-MacBook-Pro-2:python daveti$ cat big_int.py
# Big Integers
for j in range(0, 100000):
for i in range(1000, 1256):
pass
Daves-MacBook-Pro-2:python daveti$ python -m cProfile small_int.py
100003 function calls in 1.347 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 1.209 1.209 1.347 1.347 small_int.py:2()
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
100001 0.137 0.000 0.137 0.000 {range}

Daves-MacBook-Pro-2:python daveti$ python -m cProfile big_int.py
100003 function calls in 1.513 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 1.327 1.327 1.513 1.513 big_int.py:2()
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
100001 0.186 0.000 0.186 0.000 {range}

4. Thoughts

Till now, I hope you had enjoyed this post and been looking forward to my next writing:) But there is something more before we end this post. We now are clear with the idea of Integer object pool, its mechanism and implementation in Python and we now know answer to the quiz. But the key point of the quiz is not just that simple. It is telling us the performance of your Python program, if using a lot of integers, from either the view of time or space, depends on the value of these integers! Isn’t this surprising! Moving forward, to gain the best performance, we can change the range of small integers of Python. Similarly, programs, all using small integers, would have the best performance. So, either hack the Python compiler and your program, your choice:)

Posted in Programming, Stuff about Compiler | Tagged PyIntObject, Python | Leave a comment

Python Hacking – urlopen timeout issue

Posted on November 14, 2014 by daveti

Recent playing with Python urllib2 reveals an interesting fact that the timeout parameter of urlopen() does not work sometimes. The interesting issue has successfully pushed me deep into the Python source code for debugging. The final debugging, without surprise, shows a bug of Python socket module implementation. This post shares the whole procedure of the issue, the debugging, the fix and the discussion. May it help your Python hacking:)

0. Issue

After Python 2.6, timeout parameter has been introduced into urllib2 urlopen(). You can use urlopen() in this way – urlObj = urllib2.urlopen(‘an_url’, timeout=20). Ideally, this will timeout the under-layer socket if it takes too long, such as below:

Traceback (most recent call last):
File “”, line 1, in
File “/Users/daveti/Python-2.7.8/Lib/urllib2.py”, line 127, in urlopen
return _opener.open(url, data, timeout)
File “/Users/daveti/Python-2.7.8/Lib/urllib2.py”, line 404, in open
response = self._open(req, data)
File “/Users/daveti/Python-2.7.8/Lib/urllib2.py”, line 422, in _open
‘_open’, req)
File “/Users/daveti/Python-2.7.8/Lib/urllib2.py”, line 382, in _call_chain
result = func(*args)
File “/Users/daveti/Python-2.7.8/Lib/urllib2.py”, line 1217, in http_open
return self.do_open(httplib.HTTPConnection, req)
File “/Users/daveti/Python-2.7.8/Lib/urllib2.py”, line 1190, in do_open
r = h.getresponse(buffering=True)
File “/Users/daveti/Python-2.7.8/Lib/httplib.py”, line 1076, in getresponse
response.begin()
File “/Users/daveti/Python-2.7.8/Lib/httplib.py”, line 413, in begin
version, status, reason = self._read_status()
File “/Users/daveti/Python-2.7.8/Lib/httplib.py”, line 369, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File “/Users/daveti/Python-2.7.8/Lib/socket.py”, line 487, in readline
data = self._sock.recv(self._rbufsize)
socket.timeout: timed out
>>> >>>

However, I have encountered an URL which invalidates the timeout parameter of urlopen, which had got stuck seemed forever. Fortunately, Ctrl-C still worked like a charm:

>>> obj=urllib2.urlopen(url, timeout=20)
^CTraceback (most recent call last):
File “”, line 1, in
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 127, in urlopen
return _opener.open(url, data, timeout)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 410, in open
response = meth(req, response)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 523, in http_response
‘http’, request, response, code, msg, hdrs)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 442, in error
result = self._call_chain(*args)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 382, in _call_chain
result = func(*args)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 629, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 404, in open
response = self._open(req, data)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 422, in _open
‘_open’, req)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 382, in _call_chain
result = func(*args)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”, line 1187, in do_open
r = h.getresponse(buffering=True)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py”, line 1045, in getresponse
response.begin()
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py”, line 441, in begin
self.msg = HTTPMessage(self.fp, 0)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/mimetools.py”, line 25, in __init__
rfc822.Message.__init__(self, fp, seekable)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/rfc822.py”, line 108, in __init__
self.readheaders()
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py”, line 280, in readheaders
line = self.fp.readline(_MAXLINE + 1)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py”, line 476, in readline
data = self._sock.recv(self._rbufsize)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py”, line 241, in recv
return self.read(buflen)
File “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py”, line 160, in read
return self._sslobj.read(len)
KeyboardInterrupt
>>>

The last 2 traced modules are ssl and socket. Then we probably could make an argument that urlopen timeout may not work for SSL connections. So it looks like either the ssl module or the socket module does not have the timeout information. Let’s go deeper by adding some print stuffs into these modules to make it clear.

1. Python Hacking

Download the Python 2.7.8 source code from https://www.python.org/. Building Python follows the traditional GNU 3 steps – configure/make/make install. We do not want to install this hacked version to mess around with the official one. So it is just configure/make. As this issue is related with ssl module, please make sure you have OpenSSL installed on your machine. If the build complains failure about _ssl module, go ahead to hack into setup.py, which is used to setup the building prerequisites for different modules, such as the header directory and library directory. If you have to build Python using Mac OS Yosemite, like the poor myself, go ahead to pull my setup.py from the code repo (listed at the very end), which should save some time.

The last trace is _sslobj.read(), which is implemented under Modules/_ssl.c. The read() function eventually is calling the real OpenSSL wrapper, PySSL_SSLread(). Reading thru this function, we will see timeout control there. Hmm, ssl module looks like innocent. So let’s go upper, checking the socket module which is under Lib/socket.py. Jumping directly to the line 476, we will find the traced call. Looking around, we will also find the traced call is within a ‘while True‘ loop. Yep, timeout control within ‘while True’ loop? So……

What happened when urlopen got stuck my here is: _sock.recv() tried to receive something and PySSL_SSLread() returned nothing without timeout. Then _sock.recv() was triggered again as well as PySSL_SSLread()……Thanks to the ‘while True‘, the code kept running.

2. Fix

Now we have got the root cause – within the ‘while True’ loop, the socket layer expects to receive something or any exceptions, such as timeout, from the SSL socket layer. Unfortunately, neither anything nor errors was reported to the socket to break the loop in our case. The fix for this is straightforward – adding timeout checking within this ‘while True’. All the debugging code and the fix are available at the puuth code repo at the bottom.

3. Discussion

So we have fixed the urlopen timeout issue, right? Well, for my case, yes; for yours, maybe not. When you check the last traced call within the httplib module, Lib/httplib.py, you will find the issued call, self.fp.readline(_MAXLINE + 1), is also within a ‘while True‘ loop…Well, we can consider adding timeout here as well. Unfortunately, the socket object is converted into file object (self.fp). So there is no graceful way to have the timeout information within this class.

Actually, I am not going to blame these ‘while True’ loops considering all the ‘while 1’ I have done to play with the socket. It is a common practice to have this loop making sure you have all the things received or sent. And eventually, Python is doing pretty well on the timeout control. Back to the system programming, we have different timeout for different socket operations, such as connect(), recv(), send() and etc. setsockopt() is used to determine the desired timeout value for the desired function call. In Python, the timeout from urlopen() is passed from the urllib2 to the C implementation wrappers, covering all the socket operations. This is amazing and convenient indeed. While ‘while True’ loop counts on the under-layer to have some input or timeout, there was neither input nor exceptions. I can live with the fact that the socket layer returning nothing. But why not timeout?

When timeout is passed from urlopen(), this value is also saved into the socket object, which is seen by the under layers as well. For each socket operation, the time consumption is computed and the compared to the timeout value to see if it should raise a timeout error. You see? Python checks the timeout for each socket operation using the UNCHANGED timeout saved in the socket object. Even if each socket operation is not timed out, given the ‘while True’ loop, we may go far beyond the original timeout epoch! As we do need to save the timeout value from the user and keep it unchanged for use by different socket operations. For the same socket operation, such as recv() within a ‘while True’ loop, we may need another timeout, which decreases after each recv(), one solution, in my opinion, would be helpful in the long term.

4. Code

puuth: https://github.com/daveti/puuth

Posted in Programming, Stuff about Compiler | Tagged puuth, Python, socket, SSL, timeout, urllib2, urlopen | 5 Comments

CCS’14 – Securing SSL Certificate Verification through Dynamic Linking

Posted on November 6, 2014 by daveti

Our paper ‘Securing SSL Certificate Verification through Dynamic Linking‘ is ready for download from CCS’14. This post gives a brief introduction of main idea of this paper – certShim, and then concentrates on bytecode instrumenting in JVM to fix some security issues mentioned in the TMDC paper ‘The most dangerous code in the world‘. Happy hacking!

0. certShim

The TMDC paper has revealed some crucial breakage of the SSL certificate verification procedure because of the badly designed SSL libraries. To conquer these issues, the solution of certShim is dynamic linking, simple yet powerful. Under the certShim environment, when the SSL library entry function is called, instead of falling into the original SSL library implementation, the code flow will be redirected to the certShim SSL implementation, which is also built upon the original SSL libraries but provides the ability to enforce the correct SSL certification verification procedure even the application code fails to do so. For sure, there are more stuffs in certShim. Please check the paper.

1. Java

certShim uses dynamic linking, which means it only works for C/C++ SSL implementations, such as OpenSSL/GnuTLS. The other big part of applications is Java and we also want to support it. Section 3.5 Java Instrumentation of the paper gives a solutions. First, we do not want to change the application code; second, we do not want to change the JDK code as well, which contains the default SSL implementation. Unfortunately, there is no such a thing in Java called dynamic linking. Then the solution is straightforward – bytecode instrumenting.

2. Bytecode Instrumentation

Thanks to the powerful JVM interface, there is a JVM option which is used to do the bytecode injection for the running classes – Java agent (http://docs.oracle.com/javase/6/docs/api/java/lang/instrument/package-summary.html). Then another problem comes along – we do not want to write bytecode. So what? Luckily, we have got another powerful tool called Javassist (http://www.javassist.org/), which is used to translate the Java code into the bytecode. Note that we do not provide details on how to play with Java agent or Javassist. Instead, please check the csa (certShimAgent) source code below.

3. Bug & Fix

The TMDC paper shows an issue where SSL hostname verification is ignored if SSLSocketFactory is used to create a SSL socket because this method leaves ‘algorithm’ parameter NULL used by hostname verification functions. In OpenJDK 6, this function is checkIdentity(); in OpenJDK 7, the functions are checkTrusted() and checkIdentity(). Then our fix is clear. Reset the ‘algorithm’ parameter if null before the JVM runs into the checkX() function. Even if SSLSocketFactory is called in the application code, we do not need to worry as the hostname is guaranteed anyway. For implementations details, please check csa code repo.

4. Test

For sure, we could write a simple Java app calling SSLSocketFactory and connecting to a website with a bad hostname. But my personal favorite would be hacking the OpenJDK, printing the ‘algorithm’ parameter before checkX() and printing it again within the checkX(). There is a simple OpenJDK hack in the following section.

5. Done?

We have talked about the certShim paper and the Java instrumentation mentioned in the paper. Are we done? Apparently, no. There are still a lot of things on going in the SENSEI lab. I am sure I will post sth cooler in the near future.

6. Code

certShim: https://bitbucket.org/uf_sensei/cert-shim
certShimAgent: https://github.com/daveti/csa

7. OpenJDK Hack Howto (by daveti)

1. Download the OpenJDK source (OpenJDK6)
https://java.net/projects/openjdk6/downloads/
1.1 OpenJDK7
http://openjdk.java.net/projects/jdk7u/
hg clone http://hg.openjdk.java.net/jdk7u/jdk7u60/
cd jdk7u60
chmod +x ./get_source.sh
./get_source.sh

2. Hack (Sun SSL Implementation)
jdk/src/share/classes/sun/security/ssl/X509TrustManagerImpl.java

3. Build (Ubuntu 14.04 LTS x86_64)
sudo aptitude build-dep openjdk-6
sudo aptitude install openjdk-6-jdk gcc-4.8 g++-4.8
sudo aptitude install libmotif-dev
export LANG=C ALT_BOOTDIR=/usr/lib/jvm/java-6-openjdk-amd64
make all CC=gcc-4.8 CPP=g++-4.8

4. Install
The whole SDK suite could be found at
build/linux-amd64/j2sdk-image
The whole JRE suite could found at
build/linux-amd64/j2re-image

5. Test
build/linux-amd64/j2sdk-image/bin/java -version
build/linux-amd64/j2sdk-image/bin/javac Hello.java
build/linux-amd64/j2sdk-image/bin/java Hello

root@daveti-virtual-machine:~/java# /root/openjdk6_hack/build/linux-amd64/j2sdk-image/bin/java -version
openjdk version “1.6.0-internal”
OpenJDK Runtime Environment (build 1.6.0-internal-root_07_may_2014_21_37-b00)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

6. SSL Test
Update CA certs to avoid – Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Under build/linux-amd64/j2sdk-image/jre/lib/security/
mv cacerts cacerts.orig
ln -s /etc/ssl/certs/java/cacerts cacerts

root@daveti-virtual-machine:~/java# /root/openjdk6_hack/build/linux-amd64/j2sdk-image/bin/java SSLSocketClient
getCipherSuites(): [Ljava.lang.String;@eee2024
getNeedClientAuth(): false
getProtocols(): [Ljava.lang.String;@33802324
getWantClientAuth(): false

daveti: algorithm = null
HTTP/1.1 200 OK
7. Java Agent (Needs Javassit)
git clone https://github.com/jboss-javassist/javassist
cp javassist/javassist.jar build/linux-amd64/j2sdk-image/jre/lib/ext

root@daveti-virtual-machine:~/java/org/davejingtian/certshim/agent# /root/openjdk6_hack/build/linux-amd64/j2sdk-image/bin/javac -cp ../lib/javassist.jar Agent.java
warning: ../lib/javassist.jar(javassist/CannotCompileException.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/ClassPool.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/CtClass.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/CtMethod.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/LoaderClassPath.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/NotFoundException.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/ClassPath.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/CtBehavior.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
warning: ../lib/javassist.jar(javassist/CtMember.class): major version 51 is newer than 50, the highest major version supported by this compiler.
It is recommended that the compiler be upgraded.
9 warnings

ROOT CAUSE: javassist lib is compiled using JDK 7, which has major version 51. We are using JDK 6 for the compiling which has major version 50.
FIX: Build javassist lib using JDK 6 again.
THEN: cp the new javassist.jar into j2sdk/jre/lib/ext

root@daveti-virtual-machine:~/git/javassist# echo $JAVA_HOME

root@daveti-virtual-machine:~/git/javassist# export JAVA_HOME=/root/openjdk6_hack/build/linux-amd64/j2sdk-image
root@daveti-virtual-machine:~/git/javassist# echo $JAVA_HOME
/root/openjdk6_hack/build/linux-amd64/j2sdk-image
root@daveti-virtual-machine:~/git/javassist# ll
total 808
drwxr-xr-x 8 root root 4096 May 8 13:15 ./
drwxr-xr-x 3 root root 4096 May 7 23:38 ../
-rw-r–r– 1 root root 10677 May 8 13:15 build.xml
drwxr-xr-x 8 root root 4096 May 7 23:39 .git/
-rw-r–r– 1 root root 103 May 7 23:39 .gitignore
-rw-r–r– 1 root root 707490 May 7 23:39 javassist.jar.orig
-rw-r–r– 1 root root 25890 May 7 23:39 License.html
-rw-r–r– 1 root root 9526 May 7 23:39 pom.xml
-rw-r–r– 1 root root 26631 May 7 23:39 Readme.html
drwxr-xr-x 2 root root 4096 May 8 13:12 runtest/
drwxr-xr-x 9 root root 4096 May 7 23:39 sample/
drwxr-xr-x 4 root root 4096 May 7 23:39 src/
drwxr-xr-x 5 root root 4096 May 8 13:12 target/
drwxr-xr-x 2 root root 4096 May 7 23:39 tutorial/
root@daveti-virtual-machine:~/git/javassist# ant
Buildfile: /root/git/javassist/build.xml

prepare:

compile14:
[javac] /root/git/javassist/build.xml:80: warning: ‘includeantruntime’ was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 212 source files to /root/git/javassist/target/classes

jar:
[jar] Building jar: /root/git/javassist/javassist.jar
[jar] Building jar: /root/git/javassist/javassist-src.jar

BUILD SUCCESSFUL
Total time: 4 seconds
root@daveti-virtual-machine:~/git/javassist# ll
total 1976
drwxr-xr-x 8 root root 4096 May 8 13:16 ./
drwxr-xr-x 3 root root 4096 May 7 23:38 ../
-rw-r–r– 1 root root 10677 May 8 13:15 build.xml
drwxr-xr-x 8 root root 4096 May 7 23:39 .git/
-rw-r–r– 1 root root 103 May 7 23:39 .gitignore
-rw-r–r– 1 root root 677599 May 8 13:16 javassist.jar
-rw-r–r– 1 root root 707490 May 7 23:39 javassist.jar.orig
-rw-r–r– 1 root root 513625 May 8 13:16 javassist-src.jar
-rw-r–r– 1 root root 25890 May 7 23:39 License.html
-rw-r–r– 1 root root 9526 May 7 23:39 pom.xml
-rw-r–r– 1 root root 26631 May 7 23:39 Readme.html
drwxr-xr-x 2 root root 4096 May 8 13:12 runtest/
drwxr-xr-x 9 root root 4096 May 7 23:39 sample/
drwxr-xr-x 4 root root 4096 May 7 23:39 src/
drwxr-xr-x 5 root root 4096 May 8 13:12 target/
drwxr-xr-x 2 root root 4096 May 7 23:39 tutorial/
root@daveti-virtual-machine:~/git/javassist# cp ./javassist.jar /root/openjdk6_hack/build/linux-amd64/j2sdk-image/jre/lib/ext
root@daveti-virtual-machine:~/git/javassist# cp ./javassist.jar /root/java/org/davejingtian/certshim/lib
root@daveti-virtual-machine:~/git/javassist#

root@daveti-virtual-machine:~/java/org/davejingtian/certshim/agent# /root/openjdk6_hack/build/linux-amd64/j2sdk-image/bin/javac Agent.java

root@daveti-virtual-machine:~/java# /root/openjdk6_hack/build/linux-amd64/j2sdk-image/bin/jar cmf ./manifest.txt certShimAgent.jar ./org/*

root@daveti-virtual-machine:~/java# /root/openjdk6_hack/build/linux-amd64/j2sdk-image/bin/java -javaagent:certShimAgent.jar SSLSocketClient
daveti: Agent starts!

Posted in Security | Tagged certShim, csa, Java, Javassist, JDK, JVM, OpenJDK, SENSEI, SSL | 1 Comment

Hack Go – generate non-FIPS DSA key parameters using hacked Go compiler

Posted on October 15, 2014 by daveti

Common DSA key parameter generation follow FIPS 186-3, which requires (L, N) bits length for P and Q, where L is at least 1024 and N is at least 160. However, sometimes we may want non-FIPS (L, N), for example, to make the signature length less than 40 bytes. This post first investigate some possible solutions then introduces a smart way to generate non-FIPS DSA key parameters by hacking into the Go compiler source. Note: I am neither the expert of Go nor DSA! Happy hacking:)

0. Requirement

Say we want to generate DSA key parameters using (1024, 152) for P and Q (there is another long story why we want to do that:) Note that this does NOT follow FIPS 186-3, which requires Q at least 160 bits. However, this still follows the requirement of DSA key parameter generation, which requires only Q length <= data digest length. The security guarantee is reduced as well as the length of the final signature.

1. Python

The first thing I looked for was Python implementation (Yep, PyFan). There are 2 libs popped out from Google – PyCrypto and gdata-python-client. Interestingly, the 2 implementation of DSA are basically the same. Check out PyCrypto’s DSA implementation (https://github.com/dlitz/pycrypto/blob/master/lib/Crypto/PublicKey/_DSA.py). Function ‘generateQ‘, as it is named, assumes the default 160-bit seed to generate the 160-bit Q in a weird way (that I could not understand if this is the so called ‘random’). Even worse, function ‘generate_py‘, generating both P and Y, are tightly coupled with the number 160. And again, I doubt it would work if all ‘160’ are replaced with ‘152’ (in case I am wrong, please do let me know). Apparently, you have to understand it before you can change it.

2. Java

As Python was not that friendly and I am stupid, I referred to Java (this is the reason why I am stupid:) Look at Dr Yang’s DSA example (http://www.herongyang.com/Cryptography/DSA-Java-Generating-DSA-Key-Pair.html). The class we are looking for is ‘KeyPairGenerator‘, which is under ‘jdk/src/share/classes/java/security’ directory within the OpenJDK source code. Digging deeper, we will find the real DSA implementation is in ‘jdk/src/share/classes/sun/security/provider/DSAParameterGenerator.java‘. Moving on, check out method ‘generatePandQ‘ and you would not be surprised to see ‘160’ everywhere. Meanwhile, you can tell from the source code that the implementation follows FIPS step by step.

3. Go

Fortunately, I came across the Go DSA implementation within the compiler source before I had made up my mind to ‘understand’ PyCrypto. Check this out (http://golang.org/src/pkg/crypto/dsa/dsa.go). Oops, you do not know anything about Go? Neither do I! Look at function ‘GenerateParameters‘ – there is only one place where ‘160’ is used. So…

Daves-MacBook-Pro-2:dsa daveti$ diff dsa.go.orig dsa.go
43a44,45
> // daveti: hack
> L1024N152
62a65,68
> //daveti: hack
> case L1024N152:
> L = 1024
> N = 152
Daves-MacBook-Pro-2:dsa daveti$

The hack looks so elegant that we have to run some test to make sure it works indeed. Before this, let us build the Go compiler – src/all.bash. The hacked Go compiler locates under bin directory. Use the DSA application example here (https://www.socketloop.com/tutorials/golang-example-for-dsa-functions) and remember to replace the ‘L1024N160’ with ‘L1024N152’. It works like a charm and we do have a shorter DSA signature!

4. Summary

Finding the existing and popular implementation and hacking it is always a best solution given a task. It is time saving and it is fun. Even though I am not familiar with Go, it has become one of my hacking targets regarding its graceful design and implementation. When hacking Python or Java is not straightforward, try Go!

Posted in Programming, Security | Tagged crypto, DSA, FIPS, Go | 1 Comment

byte order – Linux kernel byte order APIs

Posted on August 23, 2014 by daveti

Most of the time, user-space applications do not care about the endian of the running machine, like we call send() and recv() directly expecting the right order of the data and they work! However, things could be changed if you are dealing with low-level data processing in the kernel space, such as file systems and device drivers. The good thing is the kernel has provided a standard set of APIs which are used to do byte order conversion. In this post, we implement a kernel module playing with these APIs – trying to help understand the usage and effect of these APIs. Have fun~

0. APIs

le16(/32/64)_to_cpu
cpu_to_le16(/32/64)
be16(/32/64)_to_cpu
cpu_to_be16(/32/64)

1. x86_64 arch

global_data:
01 23 45 67 89 ab cd ef
le16_to_cpu:
01 23 45 67 89 ab cd ef
le32_to_cpu:
01 23 45 67 89 ab cd ef
le64_to_cpu:
01 23 45 67 89 ab cd ef
cpu_to_le16:
01 23 45 67 89 ab cd ef
cpu_to_le32:
01 23 45 67 89 ab cd ef
cpu_to_le64:
01 23 45 67 89 ab cd ef
be16_to_cpu:
23 01 67 45 ab 89 ef cd
be32_to_cpu:
67 45 23 01 ef cd ab 89
be64_to_cpu:
ef cd ab 89 67 45 23 01
cpu_to_be16:
23 01 67 45 ab 89 ef cd
cpu_to_be32:
67 45 23 01 ef cd ab 89
cpu_to_be64:
ef cd ab 89 67 45 23 01

2. Note

Though there is a standard set of byte order APIs, they are architecture dependent. For example, ARM arch may share the same results as x86_64 while Power/SPARC may have different ones. The rule of thumb here is – Do NOT create your own byte order APIs; Stick with these standard kernel APIs!

3. R.t.D.C.

https://github.com/daveti/kbo

Posted in OS, Programming | Tagged endian, kernel, Linux | 2 Comments

ksig – sending signals from the kernel space to the user space

Posted on August 23, 2014 by daveti

While syscall kill() gives users the ability to send certain signal to certain process in the user space of Linux, the story is totally different if we want to send signals from the kernel space to the user space. After some googling, I have found an excellent page describing this topic with real code examples. All the code here is based on these examples with few changes to make them work on the latest kernel (3.13.0). Have fun~

0. Reference

http://people.ee.ethz.ch/~arkeller/linux/multi/kernel_user_space_howto-6.html

1. siginfo

Signal is eventually the struct siginfo in the kernel code. To send a signal, we have to construct this signal at first. To make sure the process in the user space is always able to receive this signal, we will use real-time signal rather than the normal signal, as it is explained in the reference. A new signal number (SIG_TEST, 44) is also defined to avoid potential conflict with existing signals, which are architecture dependent.

2. debugfs

As the user space routine, the kernel has to know the pid of target process before sending a signal. There are different ways for communications between the kernel space and the user space, such as netlink socket, relay or /proc. Here, debugfs is used to create a new entry under /sys/kernel/debug. The write operations call back is implemented allowing the kernel to receive the value (pid) written by the user space.

3. task, send_sig_info()

As process is actually the struct task in the kernel, we have to find the corresponding task once given the pid – pid_task(find_pid_ns(pid, &init_pid_ns), PIDTYPE_PID). (Note that older kernels may have different ways to do this). Once we have got the task, send_sig_info() is called to deliver this signal with the signal data (because of real-time signal) to the target task (process) – send_sig_info(SIG_TEST, &info, t).

4. R.t.D.C.

https://github.com/daveti/ksig

Posted in Dave's Tools, OS, Programming | Tagged debugfs, kernel, Linux, signal | Leave a comment

Book Recommendation – Building Probabilistic Graphical Models with Python

Posted on August 9, 2014 by daveti

“Building Probabilistic Graphical Models with Python” is the 3rd book I have reviewed from PACKT publishing. It has just been released on this June. Compared to the classic book of PGM – “PGM: Principles and Techniques” by Dr. Koller, this book is ‘tiny’. While the former provides the complete theory backgrounds of PGM, our ‘tiny’ book goes straightforward into the real-world PGM libraries and applications. This post is trying to help readers figure out what is in this book from a pure technical perspective. Cheers~

0. The book

http://bit.ly/1yQJ2Wl

1. The content

Chap 1 talks about the basics of Probability – random variables, Bayes rules and distributions. Once readers grasp the basic ideas, PyMC is introduced to help understand and enforce these concepts.

Chap 2 dives into the Bayesian Network (BN) – independence, conditional independence and D-separation. Libpgm and Scikit-learn are introduced here to create BN models, run queries and play with Naive Bayes Classifier.

Chap 3 concentrates on the bases of Markov Network (MN) – factorization, separation and Conditional Random Field (CRF).

Chap 4 hits the active research field of PGM, structure learning from the data – constraint based and score based learning. This chapter provides a lot of code snippet to understand how data and learning approaches could impact the final structure learned, using IPython, libpgm and numpy.

Chap 5 demonstrates methods of parameter learning, Maximum Likelihood Estimation (MLE) and Bayesian Estimation. Each method has a code example to show the key points of this method, using PyMC, scipy and matplotlib.

Chap 6 shows ways to do exact inference in graph models – Variable Eliminations (VE) and Junction Tree (JT), each of which has a complete code example using libpgm plus a step-by-step analysis.

Chap 7 discusses the active research field of PGM, approximate inference – message passing and sampling. (Loopy) Belief Propagation (BP) is demonstrated in an image processing example using OpenGM. Markov Chain Monte Carlo (MCMC) , Gibbs Sampling and Metropolis-Hasting are also discussed with real code examples.

2. The conclusion

Though PGM has been there for years, this area is not widely known unless you are a graduate student or researcher devoting yourself to the statistical branch of the AI field. If you are, then you must remember the days when you had to implement your own MN solver using VE (BTW, I still have my own in my github). The only problem is the ‘standard’ implementation used in the real world keeps unknown. If you are not, then you must be interested in PGM libraries, tools and applications used in the wild. Either case, this ‘tiny’ book works.

P.S. If you are looking for PGM in Java or other languages, this book does not help. As the book title, it uses Python, which is the most friendly and human-being oriented programming language in the world (in my opinion:)

Posted in AI/ML, Programming | Tagged libpgm, packtpub, pgm, PyMC, Python | Leave a comment

KT – use kernel timers in the Linux kernel

Posted on July 9, 2014 by daveti

Compared to start a timer directly in certain kernel thread which will make the thread sleep for a while, using a kernel timer may be much more desired without stopping the current working thread. This post demonstrates a live example using the kernel timer without talking about details. Some pitfalls are discussed in case that the kernel is panic().

0. R.t.D.C.

/*
 * kt.c
 * Kernel module using the kernel timer
 * Reference: http://lwn.net/images/pdf/LDD3/ch07.pdf
 * Jul 7, 2014
 * root@davejingtian.org
 * http://davejingtian.org
 */
#include <linux/module.h>
#include <linux/timer.h>
#include <linux/spinlock.h>

/* A global struct acting as critical section */
struct global_data {
	/* Spinlock */
	spinlock_t lock;
	/* Counter */
	int count;
};
struct global_data kt_global_data;

/* A timer list */
struct timer_list kt_timer;

/* Timer callback */
void timer_callback(unsigned long arg)
{
	printk(KERN_INFO "Entering: %s\n", __FUNCTION__);

	struct global_data *data = (struct global_data *)arg;
	spin_lock(&(data->lock));
	data->count++;
	spin_unlock(&(data->lock));
	mod_timer(&kt_timer, jiffies + 10*HZ); /* restarting timer */
}

/* Init the timer */
static void kt_init_timer(void)
{
	init_timer(&kt_timer);
	kt_timer.function = timer_callback;
	kt_timer.data = (unsigned long)(&kt_global_data);
	kt_timer.expires = jiffies + 10*HZ; /* 10 second */
	add_timer(&kt_timer); /* Starting the timer */

	printk(KERN_INFO "kt_timer is started\n");
}

/* The normal module worker */
static void kt_do_the_work(void)
{
	printk(KERN_INFO "Before %s, count = %d\n", __FUNCTION__,
		kt_global_data.count);

	spin_lock(&(kt_global_data.lock));
	kt_global_data.count++;
	spin_unlock(&(kt_global_data.lock));

	printk(KERN_INFO "After %s, count = %d\n", __FUNCTION__,
		kt_global_data.count);
}

static int __init kt_init(void)
{
	printk(KERN_INFO "Entering: %s\n", __FUNCTION__);
	
	/* Init the global data */
	memset(&kt_global_data, 0x0, sizeof(struct global_data));

	/* Init the spinlock */
	spin_lock_init(&(kt_global_data.lock));

	/* Init the timer */
	kt_init_timer();

	/* Do our job */
	kt_do_the_work();

	return 0;
}

static void __exit kt_exit(void)
{
	printk(KERN_INFO "exiting kt module\n");

	/* Delete the timer */
	del_timer_sync(&kt_timer);

	printk(KERN_INFO "kt_global_data.count = [%d]\n",
		kt_global_data.count);
}

module_init(kt_init);
module_exit(kt_exit);

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("KT Module");
MODULE_AUTHOR("daveti");

1. Pitfalls

LDD 3 says – “A kernel timer is a data structure that instructs the kernel to execute a user-defined function with a user-defined argument at a user-defined time.“. One can imagine a real kernel timer worker thread maintaining its timer queue for a certain CPU and doing the work accordingly. However, kernel timers are not free to use. Be sure to understand its restrictions (still from LDD 3):

No access to user space is allowed. Because there is no process context, there is no path to the user space associated with any particular process.
The current pointer is not meaningful in atomic mode and cannot be used since the relevant code has no connection with the process that has been interrupted.
No sleeping or scheduling may be performed. Atomic code may not call schedule or a form of wait_event, nor may it call any other function that could sleep. For example, calling kmalloc(…, GFP_KERNEL) is against the rules. Semaphores also must not be used since they can sleep.

Posted in OS | Tagged kernel, Linux, timer | Leave a comment

Python hacking – make ElementTree support line number

Posted on July 4, 2014 by daveti

An easy way to parse XML in Python is using Python xml.etree.ElementTree, which parses the XML document/data into a tree structure, where each node is an Element object. Only within few lines of code, one can extract all the XML information including tag and attributes. ElementTree looks perfect until the line number of certain tag block within the original XML file is wanted. In this post, we will hack into the Python source file, add line number support for the ElementTree, rebuild the Python and demonstrate our awesome hacking using simple testing script, just 4 FUN! K.R.K.C.

0. Question: can we get the line number using xml.etree.ElementTree?

http://stackoverflow.com/questions/6949395/is-there-a-way-to-get-a-line-number-from-an-elementtree-element

1. Analysis

https://docs.python.org/2/library/xml.etree.elementtree.html gives the detailed usage of ElementTree. Apparently, it does not support line number. On one hand, this is reasonable – why do we need line number if the XML is parsed into a tree structure. On the other hand, while tree structure saves most of the information from the XML file, it loses the line number information for certain tag block within the original file, which may be useful for debugging.

https://docs.python.org/2/library/pyexpat.html#module-xml.parsers.expat shows one way to get the line number information using xmlparser.CurrentLineNumber. However, we do not want to deal with the low-level parser and we want to have the line number from the ElementTree. Can we do that? Yes, hack Python!

2. R.t.D.C.

http://hg.python.org/cpython/file/f2e6c33ce3e9/Lib/xml/etree/ElementTree.py provides the latest ElementTree implementation within the Python 2.7.8 implementation. To parse a XML file, we only need to call ElementTree.parse(‘filename’) within the application. Let us find that parse() method. Because we do not determine the parser option, ElementTree.parse() method would make up a default parser for us:

parser = XMLParser(target=TreeBuilder())

Then let up jump into XMLParser class within the same source file. Looking at the constructor function, we will find the XML parser we are going to use to expat, a C-based XML parser library. (pyexpat is just a Python wrapper for the expat implementation).

parser = expat.ParserCreate(encoding, "}")

Once the parser is created, all the following work is setting different callbacks for the expat parser. These callbacks tell the expat how to process the parsed XML information. There is a one callback we are especially interested in.

parser.StartElementHandler = self._start

This callback is used to tell expat how to handle each new element/node parsed from the XML data. The start key word here stands for the starting point of this element/node. Let us go to the XMLParser._start() method, which will call the target.start() method.

return self.target.start(tag, attrib)

What is target.start() method? Remember the target argument we passed into the XMLParser constructor? That is it. The target is the TreeBuilder class within the same source file. Let us jump into TreeBuilder.start() method.

self._last = elem = self._factory(tag, attrs)

Check the constructor of TreeBuilder, we will find the _factory() method by default is the Element class. Now the story happens here is pretty clear: The XMLParser parses the XML data and saves the information as Element into a tree structure. The application can extract these information from Element object directly. Then a reasonable place to place the line number is the Element class too. Cool, we have done! Wait…but where does the line number come from? Remember the parser we have created in XMLParser constructor? It has a member called self._parser, which is exactly the xmlparser object, which means…(leave me a msg if it is not clear…)

3. Hack

3.1. Download the Python source file

https://www.python.org/download/
https://www.python.org/ftp/python/2.7.8/Python-2.7.8.tgz (I am a fan of Python 2.X)

3.2. Hack the ElementTree

Python-2.7.8/Lib/xml/etree/ElementTree.py, where we will:

add line number into the Element class and update the corresponding constructor and other methods which will create a new Element instance;
update the TreeBuilder.start() method to have an extra line number argument and call our new Element constructor with line number support;
update the XMLParser._start() method to inject the line number and call our updated TreeBuilder.start() method to create new Element with line number

4. Test

Configure and make your new Python. Then use our new Python to test a simple ElementTree application. Here it goes~

[root@daveti test]# cat pyxml.py
import xml.etree.ElementTree as ET

tree = ET.parse(‘country_data.xml’)
root = tree.getroot()

for e in root:
print e.lineNum, e.tag, e.attrib
[root@daveti test]# /root/python/Python-2.7.8/python ./pyxml.py
3 country {‘name’: ‘Liechtenstein’}
10 country {‘name’: ‘Singapore’}
16 country {‘name’: ‘Panama’}
[root@daveti test]#

5. Code – pyet

All the code is available at my github and follows GPLv2.
https://github.com/daveti/pyet

Posted in Programming | Tagged ElementTree, expat, pyet, Python, xml | 16 Comments