#!/usr/bin/env python
#_author = vkremez # This is an assignment for University of Michigan course on "Using Python to Access Web Data." # This Python program will allow us to scrape the content of a website for any URLs. # Here is the algorithm: ''' The program will use urllib to (1) read the HTML from the website data, (2) extract the href= values from the anchor tags, (3) scan for a tag that is in a particular position relative to the first name in the list, (4) follow that link and repeat the process a number of times and report the results. ''' import os import argparse import urllib from datetime import datetime from bs4 import * print os.system('echo WEB SCRAPER 1.0') print datetime.datetime.now() url = raw_input('Enter URL: ') html = urllib.urlopen(url).read() soup = BeautifulSoup(html) tags = soup('a') count = int(raw_input('Enter count: ')) position = int(raw_input('Enter position: ')) print "Retrieving: " + url print "Retrieving: " + tags[position-1].get('href', None) for x in range(0,count-1): html = urllib.urlopen(tags[position-1].get('href',None)).read() soup = BeautifulSoup(html) tags = soup('a') print "Retrieving: " + tags[position-1].get('href', None) parser = argparse.ArgumentParser(description='Web Scraper 1.0 by VK.') parser.add_argument('string', metavar='www', type=int, nargs='+', help='http://website.com format') args = parser.parse_args() print(args.accumulate(args.integers))
1 Comment
SOURCE: http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149 1. Matching a Username Pattern: /^[a-z0-9_-]{3,16}$/ A. String that matches: my-us3r_n4m3 B. String that doesn't match: th1s1s-wayt00_l0ngt0beausername (too long) 2. Matching a Password Pattern: /^[a-z0-9_-]{6,18}$/ A. String that matches: myp4ssw0rd B. String that doesn't match: mypa$$w0rd (contains a dollar sign) 3. Matching a Hex Value Pattern: /^#?([a-f0-9]{6}|[a-f0-9]{3})$/ A. String that matches: #a3c113B. B. String that doesn't match:#4d82h4 (contains the letter h) 4. Matching a Slug Pattern: /^[a-z0-9-]+$/ A. String that matches: my-title-here B. String that doesn't match: my_title_here (contains underscores) 5. Matching an Email Pattern: /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ A. String that matches:john@doe.com B. String that doesn't match: john@doe.something (TLD is too long) 6. Matching a URL Pattern:/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ A. String that matches:http://net.tutsplus.com/about B. String that doesn't match:http://google.com/some/file!.html (contains an exclamation point) 7. Matching an IP Address Pattern: /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/ 8. Matching an HTML Tag Pattern:/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/ A. String that matches:<a href="http://net.tutsplus.com/">Nettuts+</a> B. String that doesn't match:<img src="img.jpg" alt="My image>" /> (attributes can't contain greater than signs) |
AuthorVitali Kremez Archives
January 2016
Categories |