5/10/2017

Counting letters on a web page

Now we can change the previous program (letters_1.py). We'll take the text from a web page.

#letters_2.py
# -*- coding: utf-8 -*-

url="http://www.telegraph.co.uk/"

import urllib                                        
sock = urllib.urlopen(url) 
htmlSource = sock.read() 

s = htmlSource

chars = []
for i in range(255):
    chars.append(0)

for letter in s:
    indeks=ord(letter)-1
    chars[indeks]+=1          

d = len(chars)
X = []
Y = []

for i in range(d):
    if chars[i]>0 and (i+1)>=97 and (i+1)<=122:                
        X.append(chr(i+1))
        Y.append(chars[i])

sum_y = sum(Y)
print 'All small letters on (the home page)', url, ' ', sum_y
print '\nThe frequency of letters in %:\n '

for i in range(len(X)):
    Y[i] = round(100.0*Y[i]/sum_y,1)
    print '%5s %10.1f' %(X[i], Y[i])

And the results are:

All small letters on (the home page) http://www.telegraph.co.uk/   355470

The frequency of letters in %:
 
    a        9.1
    b        1.2
    c        3.8
    d        3.8
    e        9.7
    f        2.1
    g        3.1
    h        2.7
    i        7.3
    j        1.4
    k        0.7
    l        4.3
    m        4.3
    n        6.1
    o        5.1
    p        3.6
    q        1.0
    r        5.8
    s        6.8
    t        9.0
    u        2.0
    v        2.6
    w        1.4
    x        0.8
    y        1.5
    z        0.9

We can compare these results with the ones from letters.py. If we add some piece of code we can produce a bar chart that visualizes the frequency of letters. We'll use bokeh charts and data frame from pandas package. So we append such a code:

import pandas as pd

df = pd.DataFrame(
    {'letters': X,
     'freq': Y     
    })

from bokeh.charts import Bar, output_file, show


p = Bar(df, 'letters', values='freq',
        title="The frequency of letters in English texts", 
        bar_width=0.4, ylabel = "%",
        color = "green", legend = False)

output_file("letters.html")
show(p)

No comments:

Post a Comment