Now we can change the previous program (letters_1.py). We'll take the text from a web page.
#letters_2.py
# -*- coding: utf-8 -*-
url="http://www.telegraph.co.uk/"
import urllib
sock = urllib.urlopen(url)
htmlSource = sock.read()
s = htmlSource
chars = []
for i in range(255):
chars.append(0)
for letter in s:
indeks=ord(letter)-1
chars[indeks]+=1
d = len(chars)
X = []
Y = []
for i in range(d):
if chars[i]>0 and (i+1)>=97 and (i+1)<=122:
X.append(chr(i+1))
Y.append(chars[i])
sum_y = sum(Y)
print 'All small letters on (the home page)', url, ' ', sum_y
print '\nThe frequency of letters in %:\n '
for i in range(len(X)):
Y[i] = round(100.0*Y[i]/sum_y,1)
print '%5s %10.1f' %(X[i], Y[i])
And the results are:
All small letters on (the home page) http://www.telegraph.co.uk/ 355470
The frequency of letters in %:
a 9.1
b 1.2
c 3.8
d 3.8
e 9.7
f 2.1
g 3.1
h 2.7
i 7.3
j 1.4
k 0.7
l 4.3
m 4.3
n 6.1
o 5.1
p 3.6
q 1.0
r 5.8
s 6.8
t 9.0
u 2.0
v 2.6
w 1.4
x 0.8
y 1.5
z 0.9
We can compare these results with the ones from letters.py. If we add some piece of code we can produce a bar chart that visualizes the frequency of letters. We'll use bokeh charts and data frame from pandas package. So we append such a code:
import pandas as pd
df = pd.DataFrame(
{'letters': X,
'freq': Y
})
from bokeh.charts import Bar, output_file, show
p = Bar(df, 'letters', values='freq',
title="The frequency of letters in English texts",
bar_width=0.4, ylabel = "%",
color = "green", legend = False)
output_file("letters.html")
show(p)

No comments:
Post a Comment