Jul-15-2019, 01:50 AM
# Justia Court Opinion Scraper
# Works - Scrapes opinion with HTML tags
# Works - Scrapes opinion with HTML tags stripped
# Works - Write to CSV with HTML tags
# Works - Write to CSV without HTML tags
# July, 14, 2019
# localhost and law.justia.com are interchangeable!
from urllib.request import urlopen
from bs4 import BeautifulSoup
#html = urlopen("http://localhost/cases/federal/appellate-courts/F2/1/18/1506993/")
html = urlopen("http://localhost/cases/federal/appellate-courts/F2/999/663/308588/")
#html = urlopen("http://localhost/cases/federal/appellate-courts/F3/491/1/510017/")
#html = urlopen("http://localhost/cases/federal/us/385/206/case.html") <--- DOES NOT WORK with id="opinion"
bsObj = BeautifulSoup(html.read())
#bsObj.findAll(id="opinion")
allOpinion = bsObj.findAll(id="opinion")
# Want the TITLE of the Page in a Variable
import requests
import pymysql
from bs4 import BeautifulSoup
url = "http://localhost/cases/federal/appellate-courts/F2/999/663/308588/"
allTitle = bsObj.findAll({"title"})
allURL = url
#print(allOpinion[0].get_text())
# ^ Will Strip HTML tags and only store plain-text
# Column 1 [ ]
# / / of URL (third to last) (i.e /1/)
# Column 2 [ ]
# / / of URL (second to last) (i.e /18)
# Column 3 [ ]
# / / of URL (last) (i.e /1506993/)
# Column 4 [ allOpinion w/ HTML Tags ]
# Column 5 [ allOpinion w/ Stripped HTML Tags - Plaintext lump ]
# Store allOpinion to CSV File w/ Tags
db = pymysql.connect(host="localhost",
user="brandon",
password="_yLKVPSiTfEQowz_v745H5xKSUkFDUyvtyW_",
db="JustiaPython",
charset='utf8')
print(allOpinion)
print(allTitle)
print(allURL)
import csv
csvRow = [allOpinion,allTitle,allURL]
csvfile = "current_F2_opinion_with_tags_current.csv"
with open(csvfile, "a") as fp:
wr = csv.writer(fp, dialect='excel')
wr.writerow(csvRow)
# wr.writerow(['1'])
# ^ Works with retaining all the HTML tags; NEXT - Store allOpinion to a CSV, then MySQL.
# Loop w/ Stripping HTML Tags for allOpinion and it's CSV output
print(allOpinion[0].get_text(),url)
import csv
csvRow = [allOpinion[0].get_text(),allTitle[0].get_text(),allURL]
csvfile = "current_F2_opinion_without_tags_current.csv"
with open(csvfile, "a") as fp:
wr = csv.writer(fp, dialect='excel')
wr.writerow(csvRow)
# wr.writerow(['1'])I am tring to figure out a few things to make this a functional script. I would like to learn how to my pymysql work correctly and be able to create a row with allTitle allURL allOpinion with MariaDB and write appended results.I also am trying to figure out how to store certain parts of the URL as variables ; such as "999" and "663" and "308588"
My long term goal is I have a couple folders of these opinions I would like to scrape and store properly with these variables. How can I go about doing html = urlopen() from a link list rather than a single URL; I am guessing at the end of this script; I will be wanting to write a loop to go to the next court opinion.
Thanks for any help!
