Jun-16-2020, 08:01 PM
I was using to scrape a website to look for wordpress on as "/wp-", and it partially works, but it also partially doesn't.
The problem is that when it looks and counts for /wp-, it gives way too many results on all the sites I am looking at. If I manually inspect https://arstechnica.com/ and look for /wp- on it using ctrl+f, it would bring up around 46 results.
If I use the code, it brings up 922 results.
Is there a way to fix it from bring up so many results?
Also, is there a way to bring up only the first result of /wp- too?
I am curious in trying to incorporate both ways in a future code.
Thank you very much for your help and any advice you might have on how to fix this!
The problem is that when it looks and counts for /wp-, it gives way too many results on all the sites I am looking at. If I manually inspect https://arstechnica.com/ and look for /wp- on it using ctrl+f, it would bring up around 46 results.
If I use the code, it brings up 922 results.
Is there a way to fix it from bring up so many results?
Also, is there a way to bring up only the first result of /wp- too?
I am curious in trying to incorporate both ways in a future code.
Thank you very much for your help and any advice you might have on how to fix this!
#!bin/usr/python3
import urllib.request
import urlopen
import bs4
import queue
import urllib.request as urllib2
import urllib3
import re
import requests
from bs4 import BeautifulSoup
def count_words(url, the_word):
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
return len(words)
def main():
url = 'https://arstechnica.com/'
word = '/wp-'
count = count_words(url, word)
print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
if __name__ == '__main__':
main()
