Blog Archives

PYTHON: WEB SCRAPER USING BEAUTIFULSOUP and URLLIB

11/14/2015

#!/usr/bin/env python
#_author = vkremez

# This is an assignment for University of Michigan course on "Using Python to Access Web Data."

# This Python program will allow us to scrape the content of a website for any URLs.

# Here is the algorithm:
'''
The program will use urllib to (1) read the HTML from the website data, (2) extract the href= values from the anchor tags, (3) scan for a tag that is in a particular position relative to the first name in the list, (4) follow that link and repeat the process a number of times and report the results.
'''
import os
import argparse
import urllib
from datetime import datetime
from bs4 import *

print os.system('echo WEB SCRAPER 1.0')
print datetime.datetime.now()

url = raw_input('Enter URL: ')
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)
tags = soup('a')

count = int(raw_input('Enter count: '))
position = int(raw_input('Enter position: '))

print "Retrieving: " + url
print "Retrieving: " + tags[position-1].get('href', None)

for x in range(0,count-1):
html = urllib.urlopen(tags[position-1].get('href',None)).read()
soup = BeautifulSoup(html) tags = soup('a')
print "Retrieving: " + tags[position-1].get('href', None)

parser = argparse.ArgumentParser(description='Web Scraper 1.0 by VK.')
parser.add_argument('string', metavar='www', type=int, nargs='+', help='http://website.com format')
args = parser.parse_args()
print(args.accumulate(args.integers))

1 Comment

LET'S CODE: IMPORTANT REGULAR EXPRESSIONS

11/13/2015

0 Comments

SOURCE: http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149

1. Matching a Username
Pattern: /^[a-z0-9_-]{3,16}$/

A. String that matches: my-us3r_n4m3
B. String that doesn't match: th1s1s-wayt00_l0ngt0beausername (too long)

2. Matching a Password
Pattern: /^[a-z0-9_-]{6,18}$/

A. String that matches: myp4ssw0rd
B. String that doesn't match: mypa$$w0rd (contains a dollar sign)

3. Matching a Hex Value
Pattern: /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

A. String that matches: #a3c113B.
B. String that doesn't match:#4d82h4 (contains the letter h)

4. Matching a Slug
Pattern: /^[a-z0-9-]+$/

A. String that matches: my-title-here
B. String that doesn't match: my_title_here (contains underscores)

5. Matching an Email
Pattern: /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

A. String that matches:john@doe.com
B. String that doesn't match:
[email protected] (TLD is too long)

6. Matching a URL
Pattern:/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

A. String that matches:http://net.tutsplus.com/about
B. String that doesn't match:http://google.com/some/file!.html (contains an exclamation point)

7. Matching an IP Address
Pattern: /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

8. Matching an HTML Tag
Pattern:/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

A. String that matches:<a href="http://net.tutsplus.com/">Nettuts+</a>
B. String that doesn't match:<img src="img.jpg" alt="My image>" /> (attributes can't contain greater than signs)

0 Comments

PYTHON: WEB SCRAPER USING BEAUTIFULSOUP and URLLIB

LET'S CODE: IMPORTANT REGULAR EXPRESSIONS

Author

Archives

Categories

PYTHON: WEB SCRAPER USING BEAUTIFULSOUP and URLLIB﻿

LET'S CODE: IMPORTANT REGULAR EXPRESSIONS

Author

Archives

Categories

PYTHON: WEB SCRAPER USING BEAUTIFULSOUP and URLLIB