Vitali Kremez
  • Home
  • About
  • Contact
  • Cyber Security
  • Cyber Intel
  • Programming
  • Reverse Engineering
  • Exploit Development
  • Penetration Test
  • WIN32 Assembly
  • On Writing
    • Blog
    • LSAT
    • Photo
  • Honeypot
  • Forum

PYTHON: WEB SCRAPER USING BEAUTIFULSOUP and URLLIB

11/14/2015

1 Comment

 
#!/usr/bin/env python
​#_author = vkremez

# This is an assignment for University of Michigan course on "Using Python to Access Web Data."


# This Python program will allow us to scrape the content of a website for any URLs. 

# Here is the algorithm:
'''

The program will use urllib to (1) read the HTML from the website data, (2) extract the href= values from the anchor tags, (3) scan for a tag that is in a particular position relative to the first name in the list, (4) follow that link and repeat the process a number of times and report the results.
'''
import os
import argparse
import urllib
from datetime import datetime
from bs4 import *

print os.system('echo WEB SCRAPER 1.0')
print datetime.datetime.now()

url = raw_input('Enter URL: ')
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)
tags = soup('a')

count = int(raw_input('Enter count: '))
position = int(raw_input('Enter position: '))

print "Retrieving: " + url
print "Retrieving: " + tags[position-1].get('href', None)

for x in range(0,count-1):
  html = urllib.urlopen(tags[position-1].get('href',None)).read()
  soup = BeautifulSoup(html) tags = soup('a')
 print "Retrieving: " + tags[position-1].get('href', None)

parser = argparse.ArgumentParser(description='Web Scraper 1.0 by VK.')
parser.add_argument('string', metavar='www', type=int, nargs='+', help='http://website.com format')
args = parser.parse_args()
print(args.accumulate(args.integers))
1 Comment

LET'S CODE: IMPORTANT REGULAR EXPRESSIONS

11/13/2015

0 Comments

 

SOURCE: http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149


1. Matching a Username
Pattern: /^[a-z0-9_-]{3,16}$/

A. String that matches: my-us3r_n4m3
B. String that doesn't match: th1s1s-wayt00_l0ngt0beausername (too long)

2. Matching a Password
Pattern: /^[a-z0-9_-]{6,18}$/

A. String that matches: myp4ssw0rd
B. String that doesn't match: mypa$$w0rd (contains a dollar sign)

3. Matching a Hex Value
Pattern: /^#?([a-f0-9]{6}|[a-f0-9]{3})$/

A. String that matches: #a3c113B.
B. String that doesn't match:
#4d82h4 (contains the letter h)

4. Matching a Slug
Pattern: /^[a-z0-9-]+$/

A. String that matches: my-title-here
B. String that doesn't match:​ my_title_here (contains underscores)

5. Matching an Email
Pattern: /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

A. String that matches:john@doe.com
B. 
String that doesn't match:
john@doe.something (TLD is too long)

6. Matching a URL
Pattern:/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

A. String that matches:http://net.tutsplus.com/about
B. String that doesn't match:http://google.com/some/file!.html (contains an exclamation point)

7. Matching an IP Address
Pattern: /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/


8. Matching an HTML Tag
Pattern:/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

A. String that matches:<a href="http://net.tutsplus.com/">Nettuts+</a>
B. String that doesn't match:<img src="img.jpg" alt="My image>" /> (attributes can't contain greater than signs)

0 Comments

    Author

    Vitali Kremez
    The Coder

    Archives

    January 2016
    December 2015
    November 2015
    October 2015
    September 2015

    Categories

    All

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • About
  • Contact
  • Cyber Security
  • Cyber Intel
  • Programming
  • Reverse Engineering
  • Exploit Development
  • Penetration Test
  • WIN32 Assembly
  • On Writing
    • Blog
    • LSAT
    • Photo
  • Honeypot
  • Forum