Oct-22-2023, 07:07 AM
Hello all,
I'm trying to use the data of a spreadsheet as two variables to iterate through a test webscraper script using pandas, but I'm a little stumped as to how to use two columns for two variables as iterations. For example in the first loop, use A1 and A2, then for the next iteration B1 and B2, then C1 and C2 etc.
Here is my code:
I get the error:
Thank you for your time.
I'm trying to use the data of a spreadsheet as two variables to iterate through a test webscraper script using pandas, but I'm a little stumped as to how to use two columns for two variables as iterations. For example in the first loop, use A1 and A2, then for the next iteration B1 and B2, then C1 and C2 etc.
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import openpyxl
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
heading_type = []
heading = []
keyword1 = []
url1 =[]
data = {'keyword': keyword1, 'Url':url1}
wb = openpyxl.load_workbook('D:/Share/Documents/importurl.xlsx')
ws = wb['Sheet1']
for cell in ws['A']:
print(cell.value)
url = cell.value
url1.append(url)
# r = requests.get(url, headers=headers)
# soup = BeautifulSoup(r.text, features="html.parser")
for cell in ws['B']:
keyword = cell.value
print(keyword)
keyword1.append(keyword)
df = pd.DataFrame(data=data)
df.index += 1
df.to_excel(f"D:/Share/Documents/summary.xlsx") I get the error:
Error:Traceback (most recent call last):
File "D:\Share\Documents\PycharmProjects\websitelearning\main.py", line 103, in <module>
df = pd.DataFrame(data=data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 709, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py", line 481, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py", line 115, in arrays_to_mgr
index = _extract_index(arrays)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py", line 655, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same lengthI've attached the file here so hopefully it is clear. Happy to clarify further if what I'm trying to achieve is still not clear. I guess I need to run one loop that will query both column a and column b contents at the same time and iterate to the next row- but I'm not sure how to do this. Thank you for your time.
Attached Files
