by Num3Ilia 30 Jan 2024

Python

Lexical analysis

Overview

Python based on the definition which you can find on web here, is an interpreted general purpose (or multi purpose) high level programming language with easy syntax and dynamic semantics.

As you can see, based on the above statement, Python is an interpreted language which differs in approach vs a compiled one. The web is so full of details that I will not dedicate time to explain what is one and what is the other and tackle all caveats from both worlds, just want to elude the mystery and answer that Python is actually a hybrid, slower vs a compiled language and overall is an interpreted language, but under the hood (to increase its speed) its using compiled code to work the magic (PEP339 shares some love on this aspect).

Another thing which you possibly heard of is that Python nativelly cant use more then 1 CPU due to Global Interpreter Lock (GIL) but on this magic later with a dedicated page as this is quite heavy.

Bottom line:

compiled languages are faster due to the fact well compilation (source code is transformed into machine code) happens before the program is run. A caveat for example a compiled machine code for Windows will not work on ARM-based MAC (less portable), etc.
interpreted languages are slower because are not pre-compiled into machine code, instead the source code is translated on the fly (line by line -- each line of code is read, analysed, and executed in real-time --) by an interpreter program. Since the source code is not specifically (generally) compiled is going to work (more portable) on any CPU architecture as long as you have an interpreter for that architecture available.
there are other caveats but rest assured Python has the mojo you need and enough magic "battery included" to sparkle your hidden and deepest dreams.

To find out more about this caveats and if you have the curiosity or desire, here are some good sources to begin with freecodecamp, reddit, stackoverflow, geeksforgeeks, linkedin, beside this references there are many many more and if you still want to elaborate on the topic,, … google is your friend "compiled vs interpreted language" hit enter and enjoy the pletora of examples!

Back to our pleasure, the above means that when you write code (you can visualize you code as writing a story in a writer application), instead you'll be using an IDE (Integrated Development Environment --- text editors (voodoo magic) who recognize Python code and highlight sections as you write, making it easy to understand your code’s structure (visually and syntactically) --- then the computer interprets your story --- python interpreter compiles the source code (your .py program) into bytecode and then the interpreter interprets directly the bytecode or sends it to a JIT compiler if available --- .

During the interpretation all your phrases are being tokenized and this process is called lexing or lexical analysis, these tokens are the basic units like keywords, identifiers, literals, and operators.

So, while lexical analysis is a part of the process that prepares code for interpretation or compilation by breaking it down into understandable parts, interpretation is the act of running the program represented by the code or tokens.

In Python, the lexical analyzer breaks the source code (your story .py program) into tokens that are used by the syntax analyzer to construct the structure of the program. This step is essential before parsing, compiling and interpreting the code, as it simplifies the understanding of the syntax by reducing the source code to a stream of tokens, which represents the syntactical form of your program, without irrelevant information like whitespace and comments.

What this means, you can find them all explained in the python documentation here -> lexical_analysis.

Bottom line, during your story creation, when you are trying to write your code, you are going to notice some SyntaxError on the way … well a big chunk of them are going to originate from not following the lexical guidance in Python, and below what I'm trying to highlight are a couple of examples in which I have stumbled in the beginning when writing my stories (for extensive details, please do take some time and read the documentation shared in the link above, we'll help you!).

Personal Experience

I will share a couple of examples which hopefully will make your life easier when debugging, btw your IDE will help you a lot in this endeavor - here or here you can find a good list/description of them to choose from.

As you know or you will find out, this discussion about IDEs also is a long and subjective one.

Of course your can search for more links/details about IDEs, up to you, my most favorite one is PyCharm and second in line is Spyder, Jupyter I hate it, don’t ask me why but simply I hate it and VSCode somehow its not on my taste at all, but do please do try all of them and see which one you prefer or which one fits your needs (all of them has free versions).

Now lets get back to pip-ing…

Line structure

Your stories, called Python programs, are created by lines of codes, which are delimited logically by a token call NEWLINE.

A logical line is constructed from one or more physical lines by following the explicit or implicit line joining rules, this means there are 2 different elements.

e.g.:

# physical line (is what you see as a line in your text editor, readability)
total = item_one + \
        item_two + \
        item_three

# logical line (is what Python sees as a single statement)
total = item_one + item_two + item_three

Why python allows you to sometimes create logical lines from multiple physical lines ( explicit or implicit) is for readability, we are humans after all.

Below I'm sharing a couple of examples (trying to keep them connected with the official documentation) which guides you to how you can use physical and logical lines in your code.

e.g.:

lst0 = ['a','b','c']

lst1 = ['a','b',
	   'c', 'd', 'e'] 

lst2 = ['a', #item1
	   'b']

lst3 = ['a' # item1,
       'b']

lst4 = ['a' #item1
       ,'b']

aSet0 = ('a' # test
		,'c','d' 
		,'e')
	
aSet1 = ('a'
		,'c','d'
		,'e'
)
		
aDictionary = {'keyA': 'a'
			   ,'keyB': 'b'
}

Comments

As per the documentation, a comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line.

A comment signifies the end of the logical line unless the implicit line joining rules are invoked (comments are ignored by the syntax).

e.g.:

# dropping a comment here
# above examples with comments and correctly aligned 

lst0 = ['a','b','c'] # but also I can write a comment here

lst1 = ['a','b',
        'c', 'd', 'e'] # dropping a comment here

lst2 = ['a', #item1
        'b']

lst3 = ['a' #item1,
        	'b'] #->you are going to have ['ab']

lst4 = ['a' # item1
        	,'b']

aSet0 = ('a' # some comment
         ,'c','d' # some other comment
         ,'e' # more comments) ->not going to work
         )
	
aSet1 = ('a' # some comment
         'c','d' #some other comment
         ,'e' # more comments
         )
		
aDictionary = {'keyA': 'a' # value for key 1
               ,'keyB': 'b' # value for key 2
               }

As you can see, if you are in a list, set, dictionary, function or even an argument, you are allowed to comment and this is called Implicit line joining.

When you provide the conditions for example in a conditional statement you are not allowed to comment instead you are forced to use the backslash to unify the entire condition under one line ( this is called explicit line joining).

You can drop your comment before or at the end of the condition as in the below example.

e.g.:

# commenting my function example
def myFunc (a, # this is the a argument
b, # this is the b argument
c): # this is the c argument 
    print( a, b, c)
    print ( f'({a} -- {b} -- {c})')

myFunc(2,3,4)

Out: 2 3 4
(2 -- 3 -- 4)

# aligned visually
# commenting my function example
def myFunc (a, # this is the a argument
            b, # this is the b argument
            c): # this is the c argument
    print( a, b, c)
    print ( f'({a} -- {b} -- {c})')

myFunc(2,3,4)

Out: 2 3 4
(2 -- 3 -- 4)

Note: You are not allowed to comment after a backslash

Note: Even if you use strings to write a comment, pay attention a string is handled completely different vs a comment.

Now because we are talking about strings, I would like to point out that you can use for a string single quote, double quote and a combination of them multiplied by 3, plus newly introduced from Python 3.6 f-string (I will dedicate a blog post for it, so later).

Note: Anything you type in the 3x single(''' text ''') or double quote(""" text """) is going to be kept in that string and if you format a string for display, make sure it aligns as you desire at the output/or different outputs (visually it can differ on different outputs displays).

In strings you can also use escaped characters which you can find detailed in the Python documentation found in this place. I haven't used this much, so I will not go into details about it.

Note: One point to make out is that multi-line strings are not comments, although they can be used as comments anywhere in your code. Later on we'll be learning about docstrings which has a special method __doc__, who keeps special comments which are made up with multi-line strings.

When Python complies your code, strings which are not assigned to a variables are going to be tossed out (comments) but docstrings are part of your byte code and will be kept.

e.g.:

def addNumbers(x,y):
    """Takes any 2 numbers, adds them and returns the result!"""
    return x+y

print(addNumbers.__doc__)
Takes any 2 numbers, adds them and returns the result!

Another things which I have had some errors was from a syntactic restriction regarding the whitespace which is not allowed between the string prefix or byteprefix and the rest of the literal

e.g.:

# working
variable = b"example" # working
variable = r"example" # working

# not working
variable = b "example" # working
variable = r "example" # working

In our case we discussed about B, b, rb etc I want to point out that I've been using quite a lot the abbreviation r or R which stands for raw strings and treat backslashes as literal characters (are good when you want to read from a folder for example).

e.g.:

# variable to be interpreted as a raw string
variable = r"/Users/dragon/Projects/TradingForex/Data/EURUSD OHLC 1H test.csv"

e.g.:

output0 = """ This is 
    an output
            with spaces and 
                            new lines"""

print(output0)
 This is 
    an output
            with spaces and 
                            new lines

Because we are discussion about readability, which is quite important, I would like to introduce you to PEP8 which connects us lexically to the indentation discussion.

PEP8 it’s a must read for all Python developers and I do recommend you to go thru it. Its about conventions and style guidance when you create Python code, you can follow it or not, its up to you, but don’t forget others will read your code "readability", and especially this fellows have created the PEP20 - Zen of Python -- Beautiful is better than ugly.--

In Python, you are always going to have 4 spaces indentation( from left to right and multiple of 4 every time you want to indent into a function/condition etc. If you are going to replace this with a TAB or a mixture of tabs and spaces, I would like to point out that sometimes (depends on the platform) you can run into issues with your program and get TabError ( I had it, so be careful).

e.g.:

# this is a combination between proper space but not visually aligned (readability)
def perm(l):
        # Compute the list of all permutations of l
    if len(l) <= 1:
                  return [l]
    r = []
    for i in range(len(l)):
             s = l[:i] + l[i+1:]
             p = perm(s)
             for x in p:
              r.append(l[i:i+1] + x)
    return r


# this is properly spaced and visually aligned (4spaces all over)
def perm(l):
    # Compute the list of all permutations of l
    if len(l) <= 1:
        return [l]
    
    r = []
    
    for i in range(len(l)):
        s = l[:i] + l[i+1:]
        p = perm(s)
        for x in p:
            r.append(l[i:i+1] + x)
    
    return r

Wrap-Up

There are also other important guidance in the lexical analysis page shared above, like keywords, string concatenation, numbers, reserved classes and identifiers, but we'll be talking about in other dedicated topics on the blog page and youtube channel.

Remember, when writing code, following Python's lexical rules is crucial to avoid syntax errors and keeps your code readable and maintainable ( you will hear a lot around you about Zen of Python PEP20).

References

PEP 339 – Design of the CPython Compiler
Lexical Analysis Python Documentation
SIMPLE LEXICAL ANALYSER cppsecrets.com
Lexical and Syntax Analysis Georgia State University
PEP 8 – Style Guide for Python Code
PEP20 Zen of Python

Happy PiPing folks!