Now we can change the previous program (letters_1.py). We'll take the text from a web page.
#letters_2.py # -*- coding: utf-8 -*- url="http://www.telegraph.co.uk/" import urllib sock = urllib.urlopen(url) htmlSource = sock.read() s = htmlSource chars = [] for i in range(255): chars.append(0) for letter in s: indeks=ord(letter)-1 chars[indeks]+=1 d = len(chars) X = [] Y = [] for i in range(d): if chars[i]>0 and (i+1)>=97 and (i+1)<=122: X.append(chr(i+1)) Y.append(chars[i]) sum_y = sum(Y) print 'All small letters on (the home page)', url, ' ', sum_y print '\nThe frequency of letters in %:\n ' for i in range(len(X)): Y[i] = round(100.0*Y[i]/sum_y,1) print '%5s %10.1f' %(X[i], Y[i])
And the results are:
All small letters on (the home page) http://www.telegraph.co.uk/ 355470 The frequency of letters in %: a 9.1 b 1.2 c 3.8 d 3.8 e 9.7 f 2.1 g 3.1 h 2.7 i 7.3 j 1.4 k 0.7 l 4.3 m 4.3 n 6.1 o 5.1 p 3.6 q 1.0 r 5.8 s 6.8 t 9.0 u 2.0 v 2.6 w 1.4 x 0.8 y 1.5 z 0.9
We can compare these results with the ones from letters.py. If we add some piece of code we can produce a bar chart that visualizes the frequency of letters. We'll use bokeh charts and data frame from pandas package. So we append such a code:
import pandas as pd df = pd.DataFrame( {'letters': X, 'freq': Y }) from bokeh.charts import Bar, output_file, show p = Bar(df, 'letters', values='freq', title="The frequency of letters in English texts", bar_width=0.4, ylabel = "%", color = "green", legend = False) output_file("letters.html") show(p)
No comments:
Post a Comment