5/10/2017

Telegraph vs Guardian


And now we change the letters_2.py
program to make it more flexible. First of all, we define a function get_text that takes some text (a string) as an argument and returns three lists:
  • a list of small letters (for x axis),
  • list with the amount of given letters found in the text,
  • list of their rates.

Then we read the home pages of chosen web sites (to strings) and pass them as arguments respectively to the function get_text.. The returned lists we pass to bokeh figure function and line metkod to generate html page with the plot. From this page we can save the plot to the png image.
#letters_3.py
# -*- coding: utf-8 -*-

def get_text(s):
    chars = []
    for i in range(255):
        chars.append(0)
    
    for letter in s:
        indeks=ord(letter)-1
        chars[indeks]+=1          
    
    d = len(chars)
    A = []
    B = []
    C = []
    
    for i in range(d):
        if chars[i]>0 and (i+1)>=97 and (i+1)<=122:                
            A.append(chr(i+1))
            B.append(chars[i])
            C.append(0)
    
    sum_b = sum(B)
        
    for i in range(len(A)):
        C[i] = round(100.0*B[i]/sum_b,1)
    return A, B, C


url1="http://www.telegraph.co.uk/"
url2="http://www.guardian.co.uk/"


import urllib
                                        
sock = urllib.urlopen(url1) 
htmlSource1 = sock.read()
sock.close() 

sock = urllib.urlopen(url2) 
htmlSource2 = sock.read()
sock.close() 

X, Y1, Y = get_text(htmlSource1)

X, Z1, Z = get_text(htmlSource2)

print 'number of letters'

print 'Telegraph:', sum(Y1)
print 'Guardian:', sum(Z1)

from bokeh.plotting import figure, output_file, show

p = figure(x_range = X, title = 'Telegraph letters vs Guardian letters')
output_file("letters3.html")
p.line(X, Y, legend="telegraph", line_width=3)
p.line(X, Z, legend="guardian", color = 'red', line_width=3)

show(p)
The chart tells everything

No comments:

Post a Comment