How to run bulk query/huge data in python by using chunks
Jan. 27, 2023, noon
51How to run bulk query in python by using chunks
import time
import concurrent.futures
from itertools import islice
qry="""insert into demo (id,score,marks)
values(?,?,?);"""
def time_took_decorator(func):
def wrapper(*args,**kwargs):
start=time.time()
func(*args,**kwargs)
end=time.time()
seconds=end-start
time_taken=time.strftime('%H:%M:%S',time.gmtime(seconds))
print('{} time taken'.format(func.__name__)+time_taken)
return wrapper
@time_took_decorator
def sql_con(x):
print('sql connection here')
print(x)
print('sql con end')
return 'success'
def bulk_query_lrngth_checker(bulk_query):
try:
lst=bulk_query.strip().split(';')
iterator=iter(lst)
while chunk :=list(islice(iterator,500)):
temp_str=''
temp_str +=';'.join(chunk)
if temp_str.endswith(';'):
temp_str
else:
temp_str+=';'
sql_con(temp_str)
except Exception as e:
print('something went wrong',e)
def new_execute_many(info,qry):
bulk_qry=''
qry=qry.replace('?','{}')
for i in info:
tuple_data=i
bulk_qry+=''.join(qry.format(*tuple_data))
if len(bulk_qry)>5000:
bulk_query_lrngth_checker(bulk_qry)
else:
sql_con(bulk_qry)
def new_chunker(tuple_data,chunk_size=100):
iterator=iter(tuple_data)
while chunk:=list(islice(iterator,chunk_size)):
new_execute_many(chunk,qry)
user_marks=[('a',12,1.1),('b',456,1.167),('c',1290,6.90),('d',666,1.1),('e',912,13.1),('f',178,89.1),('g',12,-1.1),('h',15,-891.1),('i',12,1.1),('j',1222,01.1)]*1000
new_chunker(user_marks,10)
output:
sql connection here
insert into demo (id,score,marks)
values(a,12,1.1);insert into demo (id,score,marks)
values(b,456,1.167);insert into demo (id,score,marks)
values(c,1290,6.9);insert into demo (id,score,marks)
values(d,666,1.1);insert into demo (id,score,marks)
values(e,912,13.1);insert into demo (id,score,marks)
values(f,178,89.1);insert into demo (id,score,marks)
values(g,12,-1.1);insert into demo (id,score,marks)
values(h,15,-891.1);insert into demo (id,score,marks)
values(i,12,1.1);insert into demo (id,score,marks)
values(j,1222,1.1);
sql con end
sql_con time taken00:00:00
Colored logs for python
Jan. 27, 2023, 11:10 a.m.
7Colored Logs for Python
Printing colored messages to the console:
Logging is an essential part of developing applications and lead to quick debugging and a broader understanding of what’s going on in your app. In this easy and short article we’re going to make our console logs a bit clearer by adding some color. Let’s code!
Preparations
We first need to install a package
pip install coloredlogs
Also make sure you understand the basics of logging and how to add a handler to your log with the article below:
Step 1: create a logger
Nothing new here: we just create a logger
logging.basicConfig()
logger = logging.getLogger(name='mylogger')
We use the getLogger method so that we don’t use the root logger.
Step 2: coloredlogs
Here we install coloredlogs on our logger.
coloredlogs.install(logger=logger)
logger.propagate = False
The last line makes sure that coloredlogs doesn’t pass our log events to the root logger. This prevents that we log every event double.
Step 3: creating a colored formatter
We want to add some style to our console outputs. We’ll define that here
coloredFormatter = coloredlogs.ColoredFormatter(
fmt='[%(name)s] %(asctime)s %(funcName)s %(lineno)-3d %(message)s',
level_styles=dict(
debug=dict(color='white'),
info=dict(color='blue'),
warning=dict(color='yellow', bright=True),
error=dict(color='red', bold=True, bright=True),
critical=dict(color='black', bold=True, background='red'),
),
field_styles=dict(
name=dict(color='white'),
asctime=dict(color='white'),
funcName=dict(color='white'),
lineno=dict(color='white'),
)
)
Here you can add the style of the levels and all fields in the format.
Step 4: Create a colored stream handler
We’ll use a stream handler to print to our console. Well add the colored formatter to the handler so that it gets styled accordingly.
ch = logging.StreamHandler(stream=sys.stdout)
ch.setFormatter(fmt=coloredFormatter)
logger.addHandler(hdlr=ch)
logger.setLevel(level=logging.DEBUG)
Step 5: Log and result!
Let’s put our logger to the test!
logger.debug(msg="this is a debug message")
logger.info(msg="this is an info message")
logger.warning(msg="this is a warning message")
logger.error(msg="this is an error message")
logger.critical(msg="this is a critical message")
Python logging — saving logs to a file & sending logs to an api
Jan. 27, 2023, 10:56 a.m.
11
Including logging into your Python app is essential for knowing what your program does and quick debugging. It allows you to solve an error as quickly and easily as possible. Your program can be logging out useful information but how can it notify you when something goes wrong? We cannot read the logs off the terminal in a crashed app!
This article shows you two ways of saving your logs using handlers. The first is the simplest; just writing all logs to a file using the FileHandler. The second uses a HttpHandler to send your logs to an HTTP endpoint (like an API).
Let’s first understand the concept of using handlers when logging. As you might know, you can log a message by creating a logger and then calling one of the logging methods on that logger like below:
import logging
logging.basicConfig(level=logging.DEBUG, format=f"%(levelname)-8s: \t %(filename)s %(funcName)s %(lineno)s - %(message)s")
logger = logging.getLogger("mylogger")
logger.debug("debugging something")
logger.info("some message")
logger.error("something went wrong")
When we call the debug, info of error methods in the example above, our logger has to handle those logging calls. By default, it just prints the messages (with some meta-data as specified with the format in the basicConfig) to the console.
In the parts below we add more handlers to the logger that do other things with the logs. Per handler, we can specify the level, fields, and formats as we’ll see below.
Code examples — implementing the handlers
In this article we’ll make it our goal to add three handlers to our logger that each has their own format:
- stream handler
For printing to the console. We want to print all logs (debug and up)
- file handler
saves logs in a file. We want to save all logs except debug logs
- HTTP handler
sends logs over HTTP (to an API for example). We want to send only error and critical logs
All of these loggers will be configured separately; they will have their own format and level.
1: Setting up our logger
First, we’ll create our logger, nice and simple:
logger = logging.getLogger("test")
logger.setLevel(level=logging.DEBUG)
Notice that we don’t do anything with basicConfig anymore. We set the default level with the setLevel method, next we’re re going to specify the formatting and level for each handler separately.
2. Adding a stream handler
We’ll configure our stream handler to send the log to our console, printing it out:
logStreamFormatter = logging.Formatter(
fmt=f"%(levelname)-8s %(asctime)s \t %(filename)s @function %(funcName)s line %(lineno)s - %(message)s",
datefmt="%H:%M:%S"
)
consoleHandler = logging.StreamHandler(stream=sys.stdout)
consoleHandler.setFormatter(logStreamFormatter)
consoleHandler.setLevel(level=logging.DEBUG)
logger.addHandler(consoleHandler)
Notice that we first create a Formatter specifically for stream handler. Then we’ll define the actual StreamHandler, specifying that we want to output to sys.stdout (console) and then we set the formatter to the handler. Then we add the handler to the logging object.
The result: We’ve successfully printed the log to our console!
3. Adding a file handler
The steps are exactly the same as with the stream handler. The differences are that we specify another format and a different level.
logFileFormatter = logging.Formatter(
fmt=f"%(levelname)s %(asctime)s (%(relativeCreated)d) \t %(pathname)s F%(funcName)s L%(lineno)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
fileHandler = logging.FileHandler(filename='test.log')
fileHandler.setFormatter(logFileFormatter)
fileHandler.setLevel(level=logging.INFO)
logger.addHandler(fileHandler)
Opening the test.log file shows us that we haven’t just written a file, we can also clearly see that the log has a different format from the streamHandler. Also notice that we do not save the debug logs to the file. Exactly like we wanted.
Once you’ve used the filehandler for a while your logging file gets quite large.
4. Adding an HTTP Handler
The steps are not much different from the previous one:
from logging import handlers
logHttpFormatter = logging.Formatter(
fmt=f"%(levelname)-8s %(asctime)s \t %(filename)s @function %(funcName)s line %(lineno)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
httpHandler = logging.handlers.HTTPHandler(host='somehost', url='/logs', method='POST')
httpHandler.setFormatter(logHttpFormatter)
logger.addHandler(httpHandler)
How to detect from table names from text in python
Jan. 27, 2023, 10:25 a.m.
5How to detect from table names from text in python
import re
sql_txt="""create view `kwikl3arn`.`topics` as select names,date_created
from `kwikl3arn`.`tutorials`
inner join `kwikl3arn`.`category` on `kwikl3arn`.`category.id`=`kwikl3arn`.`tutorials.id`
order by date asc;
"""
sub1="from"
sub2="` "
idx1=x.index(sub1)
print(idx1)
try:
idx2=x.index(sub2)
except Exception as e:
idx2=x.index(';')
print(idx2)
result=[]
search=['FROM','JOIN','from','join']
for s in search:
for i in re.finditer(s,x):
print(i.start())
#search words based
result.append(x[i.start():i.start()+idx2].replace('`','').replace('\nON','').replace('(','')).replace(';',''))
final_result=[]
for t in result:
#find words between spaces
x=re.findall(r'(?<=\s)\S*',str(t))
final_result.append(x[0])
print('Total table found',len(set(final_result)))
print(set(final_result))
How to use natural language processing with spacy in python
Jan. 27, 2023, 9:35 a.m.
9How to use Natural Language Processing With spacy in Python
How to install in windows:
PS> python -m venv venv
PS> ./venv/Scripts/activate
(venv) PS> python -m pip install spacy
How to install in linux
$ python -m venv venv
$ source ./venv/bin/activate
(venv) $ python -m pip install spacy
With spaCy installed in your virtual environment, you’re almost ready to get started with NLP. But there’s one more thing you’ll have to install:
(venv) $ python -m spacy download en_core_web_sm
There are various spaCy models for different languages. The default model for the English language is designated as en_core_web_sm
. Since the models are quite large, it’s best to install them separately—including all languages in one package would make the download too massive.
Once the en_core_web_sm
model has finished downloading, open up a Python REPL and verify that the installation has been successful:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
If these lines run without any errors, then it means that spaCy was installed and that the models and data were successfully downloaded. You’re now ready to dive into NLP with spaCy!
The Doc
Object for Processed Text
In this section, you’ll use spaCy to deconstruct a given input string, and you’ll also read the same text from a file.
First, you need to load the language model instance in spaCy:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp
<spacy.lang.en.English at 0x291003a6bf0>
The load()
function returns a Language
callable object, which is commonly assigned to a variable called nlp
.
To start processing your input, you construct a Doc
object. A Doc
object is a sequence of Token
objects representing a lexical token. Each Token
object has information about a particular piece—typically one word—of text. You can instantiate a Doc
object by calling the Language
object with the input string as an argument:
>>> introduction_doc = nlp(
... "This tutorial is about Natural Language Processing in spaCy."
... )
>>> type(introduction_doc)
spacy.tokens.doc.Doc
>>> [token.text for token in introduction_doc]
['This', 'tutorial', 'is', 'about', 'Natural', 'Language',
'Processing', 'in', 'spaCy', '.']
In the above example, the text is used to instantiate a Doc
object. From there, you can access a whole bunch of information about the processed text.
For instance, you iterated over the Doc
object with a list comprehension that produces a series of Token
objects. On each Token
object, you called the .text
attribute to get the text contained within that token.
You won’t typically be copying and pasting text directly into the constructor, though. Instead, you’ll likely be reading it from a file:
>>> import pathlib
>>> file_name = "introduction.txt"
>>> introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
>>> print ([token.text for token in introduction_doc])
['This', 'tutorial', 'is', 'about', 'Natural', 'Language',
'Processing', 'in', 'spaCy', '.', '\n']
In this example, you read the contents of the introduction.txt
file with the .read_text()
method of the pathlib.Path
object. Since the file contains the same information as the previous example, you’ll get the same result.
Sentence Detection
Sentence detection is the process of locating where sentences start and end in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part-of-speech (POS) tagging and named-entity recognition, which you’ll come to later in the tutorial.
In spaCy, the .sents
property is used to extract sentences from the Doc
object. Here’s how you would extract the total number of sentences and the sentences themselves for a given input:
>>> about_text = (
... "Gus Proto is a Python developer currently"
... " working for a London-based Fintech"
... " company. He is interested in learning"
... " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)
>>> sentences = list(about_doc.sents)
>>> len(sentences)
2
>>> for sentence in sentences:
... print(f"{sentence[:5]}...")
...
Gus Proto is a Python...
He is interested in learning...
In the above example, spaCy is correctly able to identify the input’s sentences. With .sents
, you get a list of Span
objects representing individual sentences. You can also slice the Span
objects to produce sections of a sentence.
You can also customize sentence detection behavior by using custom delimiters. Here’s an example where an ellipsis (...
) is used as a delimiter, in addition to the full stop, or period (.
):
>>> ellipsis_text = (
... "Gus, can you, ... never mind, I forgot"
... " what I was saying. So, do you think"
... " we should ..."
... )
>>> from spacy.language import Language
>>> @Language.component("set_custom_boundaries")
... def set_custom_boundaries(doc):
... """Add support to use `...` as a delimiter for sentence detection"""
... for token in doc[:-1]:
... if token.text == "...":
... doc[token.i + 1].is_sent_start = True
... return doc
...
>>> custom_nlp = spacy.load("en_core_web_sm")
>>> custom_nlp.add_pipe("set_custom_boundaries", before="parser")
>>> custom_ellipsis_doc = custom_nlp(ellipsis_text)
>>> custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)
>>> for sentence in custom_ellipsis_sentences:
... print(sentence)
...
Gus, can you, ...
never mind, I forgot what I was saying.
So, do you think we should ...
For this example, you used the @Language.component("set_custom_boundaries")
decorator to define a new function that takes a Doc
object as an argument. The job of this function is to identify tokens in Doc
that are the beginning of sentences and mark their .is_sent_start
attribute to True
. Once done, the function must return the Doc
object again.
Then, you can add the custom boundary function to the Language
object by using the .add_pipe()
method. Parsing text with this modified Language
object will now treat the word after an ellipse as the start of a new sentence.
Tokens in spaCy
Building the Doc
container involves tokenizing the text. The process of tokenization breaks a text down into its basic units—or tokens—which are represented in spaCy as Token
objects.
As you’ve already seen, with spaCy, you can print the tokens by iterating over the Doc
object. But Token
objects also have other attributes available for exploration. For instance, the token’s original index position in the string is still available as an attribute on Token
:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> about_text = (
... "Gus Proto is a Python developer currently"
... " working for a London-based Fintech"
... " company. He is interested in learning"
... " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)
>>> for token in about_doc:
... print (token, token.idx)
...
Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142
In this example, you iterate over Doc
, printing both Token
and the .idx
attribute, which represents the starting position of the token in the original text. Keeping this information could be useful for in-place word replacement down the line, for example.
spaCy provides various other attributes for the Token
class:
>>> print(
... f"{"Text with Whitespace":22}"
... f"{"Is Alphanumeric?":15}"
... f"{"Is Punctuation?":18}"
... f"{"Is Stop Word?"}"
... )
>>> for token in about_doc:
... print(
... f"{str(token.text_with_ws):22}"
... f"{str(token.is_alpha):15}"
... f"{str(token.is_punct):18}"
... f"{str(token.is_stop)}"
... )
...
Text with Whitespace Is Alphanum? Is Punctuation? Is Stop Word?
Gus True False False
Proto True False False
is True False True
a True False True
Python True False False
developer True False False
currently True False False
working True False False
for True False True
a True False True
London True False False
- False True False
based True False False
Fintech True False False
company True False False
. False True False
He True False True
is True False True
interested True False False
in True False True
learning True False False
Natural True False False
Language True False False
Processing True False False
. False True False
In this example, you use f-string formatting to output a table accessing some common attributes from each Token
in Doc
:
.text_with_ws
prints the token text along with any trailing space, if present..is_alpha
indicates whether the token consists of alphabetic characters or not..is_punct
indicates whether the token is a punctuation symbol or not..is_stop
indicates whether the token is a stop word or not. You’ll be covering stop words a bit later in this tutorial.
As with many aspects of spaCy, you can also customize the tokenization process to detect tokens on custom characters. This is often used for hyphenated words such as London-based.
To customize tokenization, you need to update the tokenizer
property on the callable Language
object with a new Tokenizer
object.
To see what’s involved, imagine you had some text that used the @
symbol instead of the usual hyphen (-
) as an infix to link words together. So, instead of London-based, you had [email protected]:
>>> custom_about_text = (
... "Gus Proto is a Python developer currently"
... " working for a [email protected] Fintech"
... " company. He is interested in learning"
... " Natural Language Processing."
... )
>>> print([token.text for token in nlp(custom_about_text)[8:15]])
['for', 'a', '[email protected]', 'Fintech', 'company', '.', 'He']
In this example, the default parsing read the [email protected] text as a single token, but if you used a hyphen instead of the @
symbol, then you’d get three tokens.
To include the @
symbol as a custom infix, you need to build your own Tokenizer
object:
>>> import re
>>> from spacy.tokenizer import Tokenizer
>>> custom_nlp = spacy.load("en_core_web_sm")
>>> prefix_re = spacy.util.compile_prefix_regex(
... custom_nlp.Defaults.prefixes
... )
>>> suffix_re = spacy.util.compile_suffix_regex(
... custom_nlp.Defaults.suffixes
... )
>>> custom_infixes = [r"@"]
>>> infix_re = spacy.util.compile_infix_regex(
... list(custom_nlp.Defaults.infixes) + custom_infixes
... )
>>> custom_nlp.tokenizer = Tokenizer(
... nlp.vocab,
... prefix_search=prefix_re.search,
... suffix_search=suffix_re.search,
... infix_finditer=infix_re.finditer,
... token_match=None,
... )
>>> custom_tokenizer_about_doc = custom_nlp(custom_about_text)
>>> print([token.text for token in custom_tokenizer_about_doc[8:15]])
['for', 'a', 'London', '@', 'based', 'Fintech', 'company']
n this example, you first instantiate a new Language
object. To build a new Tokenizer
, you generally provide it with:
Vocab
: A storage container for special cases, which is used to handle cases like contractions and emoticons.prefix_search
: A function that handles preceding punctuation, such as opening parentheses.suffix_search
: A function that handles succeeding punctuation, such as closing parentheses.infix_finditer
: A function that handles non-whitespace separators, such as hyphens.token_match
: An optional Boolean function that matches strings that should never be split. It overrides the previous rules and is useful for entities like URLs or numbers.
The functions involved are typically regex functions that you can access from compiled regex objects. To build the regex objects for the prefixes and suffixes—which you don’t want to customize—you can generate them with the defaults, shown on lines 5 to 10.
To make a custom infix function, first you define a new list on line 12 with any regex patterns that you want to include. Then, you join your custom list with the Language
object’s .Defaults.infixes
attribute, which needs to be cast to a list before joining. You want to do this to include all the existing infixes. Then you pass the extended tuple as an argument to spacy.util.compile_infix_regex()
to obtain your new regex object for infixes.
When you call the Tokenizer
constructor, you pass the .search()
method on the prefix and suffix regex objects, and the .finditer()
function on the infix regex object. Now you can replace the tokenizer on the custom_nlp
object.
After that’s done, you’ll see that the @
symbol is now tokenized separately.
Stop Words
Stop words are typically defined as the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make grammatical sense.
With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. spaCy stores a list of stop words for the English language:
>>> import spacy
>>> spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
>>> len(spacy_stopwords)
326
>>> for stop_word in list(spacy_stopwords)[:10]:
... print(stop_word)
...
using
becomes
had
itself
once
often
is
herein
who
too
In this example, you’ve examined the STOP_WORDS
list from spacy.lang.en.stop_words
. You don’t need to access this list directly, though. You can remove stop words from the input text by making use of the .is_stop
attribute of each token:
>>> custom_about_text = (
... "Gus Proto is a Python developer currently"
... " working for a London-based Fintech"
... " company. He is interested in learning"
... " Natural Language Processing."
... )
>>> nlp = spacy.load("en_core_web_sm")
>>> about_doc = nlp(custom_about_text)
>>> print([token for token in about_doc if not token.is_stop])
[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech,
company, ., interested, learning, Natural, Language, Processing, .]
Here you use a list comprehension with a conditional expression to produce a list of all the words that are not stop words in the text.
While you can’t be sure exactly what the sentence is trying to say without stop words, you still have a lot of information about what it’s generally about.
Lemmatization
Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a lemma.
For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. The inflection of a word allows you to express different grammatical categories, like tense (organized vs organize), number (trains vs train), and so on. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.
spaCy puts a lemma_
attribute on the Token
class. This attribute has the lemmatized form of the token:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> conference_help_text = (
... "Gus is helping organize a developer"
... " conference on Applications of Natural Language"
... " Processing. He keeps organizing local Python meetups"
... " and several internal talks at his workplace."
... )
>>> conference_help_doc = nlp(conference_help_text)
>>> for token in conference_help_doc:
... if str(token) != str(token.lemma_):
... print(f"{str(token):>20} : {str(token.lemma_)}")
...
is : be
He : he
keeps : keep
organizing : organize
meetups : meetup
talks : talk
In this example, you check to see if the original word is different from the lemma, and if it is, you print both the original word and its lemma.
You’ll note, for instance, that organizing
reduces to its lemma form, organize
. If you don’t lemmatize the text, then organize
and organizing
will be counted as different tokens, even though they both refer to the same concept. Lemmatization helps you avoid duplicate words that may overlap conceptually.
Word Frequency
You can now convert a given text into tokens and perform statistical analysis on it. This analysis can give you various insights, such as common words or unique words in the text:
>>> import spacy
>>> from collections import Counter
>>> nlp = spacy.load("en_core_web_sm")
>>> complete_text = (
... "Gus Proto is a Python developer currently"
... " working for a London-based Fintech company. He is"
... " interested in learning Natural Language Processing."
... " There is a developer conference happening on 21 July"
... ' 2019 in London. It is titled "Applications of Natural'
... ' Language Processing". There is a helpline number'
... " available at +44-1234567891. Gus is helping organize it."
... " He keeps organizing local Python meetups and several"
... " internal talks at his workplace. Gus is also presenting"
... ' a talk. The talk will introduce the reader about "Use'
... ' cases of Natural Language Processing in Fintech".'
... " Apart from his work, he is very passionate about music."
... " Gus is learning to play the Piano. He has enrolled"
... " himself in the weekend batch of Great Piano Academy."
... " Great Piano Academy is situated in Mayfair or the City"
... " of London and has world-class piano instructors."
... )
>>> complete_doc = nlp(complete_text)
>>> words = [
... token.text
... for token in complete_doc
... if not token.is_stop and not token.is_punct
... ]
>>> print(Counter(words).most_common(5))
[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]
By looking just at the common words, you can probably assume that the text is about Gus
, London
, and Natural Language Processing
. That’s a significant finding! If you can just look at the most common words, that may save you a lot of reading, because you can immediately tell if the text is about something that interests you or not.
That’s not to say this process is guaranteed to give you good results. You are losing some information along the way, after all.
That said, to illustrate why removing stop words can be useful, here’s another example of the same text including stop words:
>>> Counter(
... [token.text for token in complete_doc if not token.is_punct]
... ).most_common(5)
[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('of', 4)]
Four out of five of the most common words are stop words that don’t really tell you much about the summarized text. This is why stop words are often considered noise for many applications.
Part-of-Speech Tagging
Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:
- Noun
- Pronoun
- Adjective
- Verb
- Adverb
- Preposition
- Conjunction
- Interjection
Part-of-speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.
In spaCy, POS tags are available as an attribute on the Token
object:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> about_text = (
... "Gus Proto is a Python developer currently"
... " working for a London-based Fintech"
... " company. He is interested in learning"
... " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)
>>> for token in about_doc:
... print(
... f"""
... TOKEN: {str(token)}
... =====
... TAG: {str(token.tag_):10} POS: {token.pos_}
... EXPLANATION: {spacy.explain(token.tag_)}"""
... )
...
TOKEN: Gus
=====
TAG: NNP POS: PROPN
EXPLANATION: noun, proper singular
TOKEN: Proto
=====
TAG: NNP POS: PROPN
EXPLANATION: noun, proper singular
TOKEN: is
=====
TAG: VBZ POS: AUX
EXPLANATION: verb, 3rd person singular present
TOKEN: a
=====
TAG: DT POS: DET
EXPLANATION: determiner
TOKEN: Python
=====
TAG: NNP POS: PROPN
EXPLANATION: noun, proper singular
...
Here, two attributes of the Token
class are accessed and printed using f-strings:
.tag_
displays a fine-grained tag..pos_
displays a coarse-grained tag, which is a reduced version of the fine-grained tags.
You also use spacy.explain()
to give descriptive details about a particular POS tag, which can be a valuable reference tool.
By using POS tags, you can extract a particular category of words:
>>> nouns = []
>>> adjectives = []
>>> for token in about_doc:
... if token.pos_ == "NOUN":
... nouns.append(token)
... if token.pos_ == "ADJ":
... adjectives.append(token)
...
>>> nouns
[developer, company]
>>> adjectives
[interested]
You can use this type of word classification to derive insights. For instance, you could gauge sentiment by analyzing which adjectives are most commonly used alongside nouns.
Visualization: Using displaCy
spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.
You can use displaCy to find POS tags for tokens:
>>> import spacy
>>> from spacy import displacy
>>> nlp = spacy.load("en_core_web_sm")
>>> about_interest_text = (
... "He is interested in learning Natural Language Processing."
... )
>>> about_interest_doc = nlp(about_interest_text)
>>> displacy.serve(about_interest_doc, style="dep")
The above code will spin up a simple web server. You can then see the visualization by going to http://127.0.0.1:5000
in your browser:
In the image above, each token is assigned a POS tag written just below the token.
You can also use displaCy in a Jupyter notebook:
In [1]: displacy.render(about_interest_doc, style="dep", jupyter=True)
Have a go at playing around with different texts to see how spaCy deconstructs sentences. Also, take a look at some of the displaCy options available for customizing the visualization.
Preprocessing Functions
To bring your text into a format ideal for analysis, you can write preprocessing functions to encapsulate your cleaning process. For example, in this section, you’ll create a preprocessor that applies the following operations:
- Lowercases the text
- Lemmatizes each token
- Removes punctuation symbols
- Removes stop words
A preprocessing function converts text to an analyzable format. It’s typical for most NLP tasks. Here’s an example:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> complete_text = (
... "Gus Proto is a Python developer currently"
... " working for a London-based Fintech company. He is"
... " interested in learning Natural Language Processing."
... " There is a developer conference happening on 21 July"
... ' 2019 in London. It is titled "Applications of Natural'
... ' Language Processing". There is a helpline number'
... " available at +44-1234567891. Gus is helping organize it."
... " He keeps organizing local Python meetups and several"
... " internal talks at his workplace. Gus is also presenting"
... ' a talk. The talk will introduce the reader about "Use'
... ' cases of Natural Language Processing in Fintech".'
... " Apart from his work, he is very passionate about music."
... " Gus is learning to play the Piano. He has enrolled"
... " himself in the weekend batch of Great Piano Academy."
... " Great Piano Academy is situated in Mayfair or the City"
... " of London and has world-class piano instructors."
... )
>>> complete_doc = nlp(complete_text)
>>> def is_token_allowed(token):
... return bool(
... token
... and str(token).strip()
... and not token.is_stop
... and not token.is_punct
... )
...
>>> def preprocess_token(token):
... return token.lemma_.strip().lower()
...
>>> complete_filtered_tokens = [
... preprocess_token(token)
... for token in complete_doc
... if is_token_allowed(token)
... ]
>>> complete_filtered_tokens
['gus', 'proto', 'python', 'developer', 'currently', 'work',
'london', 'base', 'fintech', 'company', 'interested', 'learn',
'natural', 'language', 'processing', 'developer', 'conference',
'happen', '21', 'july', '2019', 'london', 'title',
'applications', 'natural', 'language', 'processing', 'helpline',
'number', 'available', '+44', '1234567891', 'gus', 'help',
'organize', 'keep', 'organize', 'local', 'python', 'meetup',
'internal', 'talk', 'workplace', 'gus', 'present', 'talk', 'talk',
'introduce', 'reader', 'use', 'case', 'natural', 'language',
'processing', 'fintech', 'apart', 'work', 'passionate', 'music',
'gus', 'learn', 'play', 'piano', 'enrol', 'weekend', 'batch',
'great', 'piano', 'academy', 'great', 'piano', 'academy',
'situate', 'mayfair', 'city', 'london', 'world', 'class',
'piano', 'instructor']
Note that complete_filtered_tokens
doesn’t contain any stop words or punctuation symbols, and it consists purely of lemmatized lowercase tokens.
Rule-Based Matching Using spaCy
Rule-based matching is one of the steps in extracting information from unstructured text. It’s used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).
While you can use regular expressions to extract entities (such as phone numbers), rule-based matching in spaCy is more powerful than regex alone, because you can include semantic or grammatical filters.
For example, with rule-based matching, you can extract a first name and a last name, which are always proper nouns:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> about_text = (
... "Gus Proto is a Python developer currently"
... " working for a London-based Fintech"
... " company. He is interested in learning"
... " Natural Language Processing."
... )
>>> about_doc = nlp(about_text)
>>> from spacy.matcher import Matcher
>>> matcher = Matcher(nlp.vocab)
>>> def extract_full_name(nlp_doc):
... pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}]
... matcher.add("FULL_NAME", [pattern])
... matches = matcher(nlp_doc)
... for _, start, end in matches:
... span = nlp_doc[start:end]
... yield span.text
...
>>> next(extract_full_name(about_doc))
'Gus Proto'
In this example, pattern
is a list of objects that defines the combination of tokens to be matched. Both POS tags in it are PROPN
(proper noun). So, the pattern
consists of two objects in which the POS tags for both tokens should be PROPN
. This pattern is then added to Matcher
with the .add()
method, which takes a key
identifier and a list of patterns. Finally, matches are obtained with their starting and end indexes.
You can also use rule-based matching to extract phone numbers:
>>> conference_org_text = ("There is a developer conference"
... " happening on 21 July 2019 in London. It is titled"
... ' "Applications of Natural Language Processing".'
... " There is a helpline number available"
... " at (123) 456-7891")
...
>>> def extract_phone_number(nlp_doc):
... pattern = [
... {"ORTH": "("},
... {"SHAPE": "ddd"},
... {"ORTH": ")"},
... {"SHAPE": "ddd"},
... {"ORTH": "-", "OP": "?"},
... {"SHAPE": "dddd"},
... ]
... matcher.add("PHONE_NUMBER", None, pattern)
... matches = matcher(nlp_doc)
... for match_id, start, end in matches:
... span = nlp_doc[start:end]
... return span.text
...
>>> conference_org_doc = nlp(conference_org_text)
>>> extract_phone_number(conference_org_doc)
'(123) 456-7891'
In this example, the pattern is updated in order to match phone numbers. Here, some attributes of the token are also used:
ORTH
matches the exact text of the token.SHAPE
transforms the token string to show orthographic features,d
standing for digit.OP
defines operators. Using?
as a value means that the pattern is optional, meaning it can match 0 or 1 times.
Chaining together these dictionaries gives you a lot of flexibility to choose your matching criteria.
Note: For simplicity, in the example, phone numbers are assumed to be of a particular format: (123) 456-7891
. You can change this depending on your use case.
Again, rule-based matching helps you identify and extract tokens and phrases by matching according to lexical patterns and grammatical features. This can be useful when you’re looking for a particular entity.
Dependency Parsing Using spaCy
Dependency parsing is the process of extracting the dependency graph of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the root of the sentence. All other words are linked to the headword.
The dependencies can be mapped in a directed graph representation where:
- Words are the nodes.
- Grammatical relationships are the edges.
Dependency parsing helps you know what role a word plays in the text and how different words relate to each other.
Here’s how you can use dependency parsing to find the relationships between words:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> piano_text = "Gus is learning piano"
>>> piano_doc = nlp(piano_text)
>>> for token in piano_doc:
... print(
... f"""
... TOKEN: {token.text}
... =====
... {token.tag_ = }
... {token.head.text = }
... {token.dep_ = }"""
... )
...
TOKEN: Gus
=====
token.tag_ = 'NNP'
token.head.text = 'learning'
token.dep_ = 'nsubj'
TOKEN: is
=====
token.tag_ = 'VBZ'
token.head.text = 'learning'
token.dep_ = 'aux'
TOKEN: learning
=====
token.tag_ = 'VBG'
token.head.text = 'learning'
token.dep_ = 'ROOT'
TOKEN: piano
=====
token.tag_ = 'NN'
token.head.text = 'learning'
token.dep_ = 'dobj'
In this example, the sentence contains three relationships:
nsubj
is the subject of the word, and its headword is a verb.aux
is an auxiliary word, and its headword is a verb.dobj
is the direct object of the verb, and its headword is also a verb.
The list of relationships isn’t particular to spaCy. Rather, it’s an evolving field of linguistics research.
You can also use displaCy to visualize the dependency tree of the sentence:
>>> displacy.serve(piano_doc, style="dep")
This code will produce a visualization that you can access by opening http://127.0.0.1:5000
in your browser:
This image shows you visually that the subject of the sentence is the proper noun Gus
and that it has a learn
relationship with piano
.
Tree and Subtree Navigation
The dependency graph has all the properties of a tree. This tree contains information about sentence structure and grammar and can be traversed in different ways to extract relationships.
spaCy provides attributes like .children
, .lefts
, .rights
, and .subtree
to make navigating the parse tree easier. Here are a few examples of using those attributes:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> one_line_about_text = (
... "Gus Proto is a Python developer"
... " currently working for a London-based Fintech company"
... )
>>> one_line_about_doc = nlp(one_line_about_text)
>>> # Extract children of `developer`
>>> print([token.text for token in one_line_about_doc[5].children])
['a', 'Python', 'working']
>>> # Extract previous neighboring node of `developer`
>>> print (one_line_about_doc[5].nbor(-1))
Python
>>> # Extract next neighboring node of `developer`
>>> print (one_line_about_doc[5].nbor())
currently
>>> # Extract all tokens on the left of `developer`
>>> print([token.text for token in one_line_about_doc[5].lefts])
['a', 'Python']
>>> # Extract tokens on the right of `developer`
>>> print([token.text for token in one_line_about_doc[5].rights])
['working']
>>> # Print subtree of `developer`
>>> print (list(one_line_about_doc[5].subtree))
[a, Python, developer, currently, working, for, a, London, -, based, Fintech
company]
In these examples, you’ve gotten to know various ways to navigate the dependency tree of a sentence.
Shallow Parsing
Shallow parsing, or chunking, is the process of extracting phrases from unstructured text. This involves chunking groups of adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.
Noun Phrase Detection
A noun phrase is a phrase that has a noun as its head. It could also include other kinds of words, such as adjectives, ordinals, and determiners. Noun phrases are useful for explaining the context of the sentence. They help you understand what the sentence is about.
spaCy has the property .noun_chunks
on the Doc
object. You can use this property to extract noun phrases:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> conference_text = (
... "There is a developer conference happening on 21 July 2019 in London."
... )
>>> conference_doc = nlp(conference_text)
>>> # Extract Noun Phrases
>>> for chunk in conference_doc.noun_chunks:
... print (chunk)
...
a developer conference
21 July
London
By looking at noun phrases, you can get information about your text. For example, a developer conference
indicates that the text mentions a conference, while the date 21 July
lets you know that the conference is scheduled for 21 July
.
This is yet another method to summarize a text and obtain the most important information without having to actually read it all.
Verb Phrase Detection
A verb phrase is a syntactic unit composed of at least one verb. This verb can be joined by other chunks, such as noun phrases. Verb phrases are useful for understanding the actions that nouns are involved in.
spaCy has no built-in functionality to extract verb phrases, so you’ll need a library called textacy
. You can use pip
to install textacy
:
(venv) $ python -m pip install textacy
Now that you have textacy
installed, you can use it to extract verb phrases based on grammatical rules:
>>> import textacy
>>> about_talk_text = (
... "The talk will introduce reader about use"
... " cases of Natural Language Processing in"
... " Fintech, making use of"
... " interesting examples along the way."
... )
>>> patterns = [{"POS": "AUX"}, {"POS": "VERB"}]
>>> about_talk_doc = textacy.make_spacy_doc(
... about_talk_text, lang="en_core_web_sm"
... )
>>> verb_phrases = textacy.extract.token_matches(
... about_talk_doc, patterns=patterns
... )
>>> # Print all verb phrases
>>> for chunk in verb_phrases:
... print(chunk.text)
...
will introduce
>>> # Extract noun phrase to explain what nouns are involved
>>> for chunk in about_talk_doc.noun_chunks:
... print (chunk)
...
this talk
the speaker
the audience
the use cases
Natural Language Processing
Fintech
use
interesting examples
the way
In this example, the verb phrase introduce
indicates that something will be introduced. By looking at the noun phrases, you can piece together what will be introduced—again, without having to read the whole text.
Named-Entity Recognition
Named-entity recognition (NER) is the process of locating named entities in unstructured text and then classifying them into predefined categories, such as person names, organizations, locations, monetary values, percentages, and time expressions.
You can use NER to learn more about the meaning of your text. For example, you could use it to populate tags for a set of documents in order to improve the keyword search. You could also use it to categorize customer support tickets into relevant categories.
spaCy has the property .ents
on Doc
objects. You can use it to extract named entities:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> piano_class_text = (
... "Great Piano Academy is situated"
... " in Mayfair or the City of London and has"
... " world-class piano instructors."
... )
>>> piano_class_doc = nlp(piano_class_text)
>>> for ent in piano_class_doc.ents:
... print(
... f"""
... {ent.text = }
... {ent.start_char = }
... {ent.end_char = }
... {ent.label_ = }
... spacy.explain('{ent.label_}') = {spacy.explain(ent.label_)}"""
... )
...
ent.text = 'Great Piano Academy'
ent.start_char = 0
ent.end_char = 19
ent.label_ = 'ORG'
spacy.explain('ORG') = Companies, agencies, institutions, etc.
ent.text = 'Mayfair'
ent.start_char = 35
ent.end_char = 42
ent.label_ = 'LOC'
spacy.explain('LOC') = Non-GPE locations, mountain ranges, bodies of water
ent.text = 'the City of London'
ent.start_char = 46
ent.end_char = 64
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states
In the above example, ent
is a Span
object with various attributes:
.text
gives the Unicode text representation of the entity..start_char
denotes the character offset for the start of the entity..end_char
denotes the character offset for the end of the entity..label_
gives the label of the entity.
spacy.explain
gives descriptive details about each entity label. You can also use displaCy to visualize these entities:
>>> displacy.serve(piano_class_doc, style="ent")
If you open http://127.0.0.1:5000
in your browser, then you’ll be able to see the visualization:
One use case for NER is to redact people’s names from a text. For example, you might want to do this in order to hide personal information collected in a survey. Take a look at the following example:
>>> survey_text = (
... "Out of 5 people surveyed, James Robert,"
... " Julie Fuller and Benjamin Brooks like"
... " apples. Kelly Cox and Matthew Evans"
... " like oranges."
... )
>>> def replace_person_names(token):
... if token.ent_iob != 0 and token.ent_type_ == "PERSON":
... return "[REDACTED] "
... return token.text_with_ws
...
>>> def redact_names(nlp_doc):
... with nlp_doc.retokenize() as retokenizer:
... for ent in nlp_doc.ents:
... retokenizer.merge(ent)
... tokens = map(replace_person_names, nlp_doc)
... return "".join(tokens)
...
>>> survey_doc = nlp(survey_text)
>>> print(redact_names(survey_doc))
Out of 5 people surveyed, [REDACTED] , [REDACTED] and [REDACTED] like apples.
[REDACTED] and [REDACTED] like oranges.
In this example, replace_person_names()
uses .ent_iob
, which gives the IOB code of the named entity tag using inside-outside-beginning (IOB) tagging.
The redact_names()
function uses a retokenizer to adjust the tokenizing model. It gets all the tokens and passes the text through map()
to replace any target tokens with [REDACTED]
.
So just like that, you would be able to redact a huge amount of text in seconds, while doing it manually could take many hours. That said, you always need to be careful with redaction, because the models aren’t perfect!
What is decorator in python
Oct. 11, 2022, 8:05 p.m.
139What is decorator in python
decorator is a design pattern in python that allows a user to add new functionality to an existing object with out modifing it's structure
def my_decorator(func):
def wrapper():
print('im in wrapper')
func()
print('im going to exit from wrapper')
return wrapper
@my_decorator
def greet():
print('welcome')
greet()
Output:
im in wrapper
welcome
im going to exit from wrapper
How to detect csv,excel,xls files from url and save file separately
Aug. 25, 2022, 8:58 a.m.
156how to detect csv,excel,xls files from url and save file separately
import requests
import os
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import unquote, unquote_plus
url = 'https://www.dm.usda.gov/smallbus/forecast.htm'
# url='https://www2.ed.gov/fund/contract/find/forecast.html'
# url='https://www.commerce.gov/oam/vendors/procurement-forecasts'
file_types = ['.xls', '.xlsx', '.csv']
dup=[]
file_link=[]
for file_type in file_types:
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
# print('here',link)
if link.has_attr('href'):
if file_type in link['href']:
fil_name=link.text.replace(' ', '_').replace('[', '_').replace(']', '_').replace('(', '_').replace(')', '_')
if url == 'https://www.dm.usda.gov/smallbus/forecast.htm':
url='https://www.dm.usda.gov/smallbus/'
else:
url=url
full_path = url + link['href']
dup.append(full_path)
file_link.append(fil_name+','+full_path)
# print(link['href'])
mylist = list(dict.fromkeys(dup))
# print(mylist)
print('---------------------------------------------------')
print('excel files found in following links')
for me in mylist:
print(me)
# print(unquote_plus(me))
print('---------------------------------------------------')
print('url names with link')
for fname in file_link:
x = fname.split(",")
# x[1] url of the file
# x[0] file name of the url
resp = requests.get(x[1])
print(resp)
print('saveing file for ',x[1])
with open(str(x[0])+'.csv', 'wb') as output:
output.write(resp.content)
Reading excel file from url using pandas
Aug. 25, 2022, 8:47 a.m.
67Reading excel file from url using pandas
install:
pip install pandas
pip install requests
pip install openpyxl
import pandas as pd
import openpyxl
import os
import requests
#pip install openpyxl
url='https://www.dm.usda.gov/smallbus/docs/Agricultural+Marketing+Service+FY+2021+Procurement+Forecast.xlsx'
resp = requests.get(url)
# output = open('tesddt.csv', 'wb')
# output.write(resp.content)
# output.close()
df = pd.read_excel(url,engine='openpyxl')
# # show the dataframe
df
Count number of letters and numbers in a string in python
Dec. 13, 2021, 10:10 a.m.
369count number of letters and numbers in a string in python
x="hello world! 12345"
from collections import Counter
mycounter = Counter(x)
# removing duplicated letters by using set()
# arranging letters by using sort()
for y in sorted(set(x)):
print(y+'='+str(mycounter[y]))
output:
=2
!=1
1=1
2=1
3=1
4=1
5=1
d=1
e=1
h=1
l=3
o=2
r=1
w=1
Switch case in python
Dec. 10, 2021, 9:37 a.m.
128Switch case in python
#switch case in python
def add():
print('added')
def mul():
print('multiple ')
def nodata():
print('funct not found ')
func_dict={
'cond_a':add,
'cond_b':mul
}
cond='cond_a'
func_dict.get(cond,nodata)()
cond='cond_c'
func_dict.get(cond,nodata)()
print('------------method 2---------------')
#method two
def calc(operator,x,y):
return {
'add':lambda:x+y,
'sub':lambda:x-y
}.get(operator,lambda:"operator not found" )()
print(calc('add',1,2))
print(calc('mul',1,2))
Output:
added
funct not found
------------method 2---------------
3
operator not found
Convert given number of days in terms of weeks and days
Nov. 2, 2021, 2:10 a.m.
228convert given number of days in terms of Weeks and Days
#32=4weeks + 4days
#7=1week
#8=1week +1day
#6=6days
def find( number_of_days ):
week = int((number_of_days) /7)
days = (number_of_days) % 7
print(str(week)+' week +'+ str(days)+' day')
find(32)
output:
4weeks + 4days
How remove more than one space in string python
July 14, 2021, 10:37 p.m.
332How to remove more than one space in string python
import re
re.sub(' +', ' ', 'The quick brown fox')
output:
'The quick brown fox'
import re
def spacing_issue(str_title):
temp = str_title
try:
str_title = str_title.strip()
str_title = re.sub(' +,', ',', str_title) # spaces before coma
str_title = re.sub(' +/', '/', str_title) # spaces before forward slash /
str_title = re.sub('/ +', '/', str_title) # spaces after forward slash /
str_title = re.sub(' +:', ':', str_title) # spaces before colon :
str_title = re.sub(':\S+', ': ', str_title) # not a space after colon :
str_title = re.sub(' +', ' ', str_title) # double spacing
str_title = re.sub('\( +', '(', str_title) # spaces before open braces( :
str_title = re.sub(' +\)', ')', str_title) # spaces before closed braces ) :
return str_title
except Exception as e:
print("error", e)
return temp
print(spacing_issue('hello, world'))
output:
hello, world
Random password generator in python
June 27, 2021, 9:56 a.m.
225Random Password Generator in Python
import string as s
from random import *
ch=s.ascii_letters+s.digits+s.punctuation
# print(ch)
password="".join(choice(ch) for x in range(randint(8,16)))
print(password)
output:
U][email protected]
How to create and manipulate sql databases with python
April 18, 2021, 8:27 a.m.
289How to Create and Manipulate SQL Databases with Python
install:
pip install mysql-connector-python
pip install pandas
Importing Libraries
As with every project in Python, the very first thing we want to do is import our libraries
It is best practice to import all the libraries we are going to use at the beginning of the project, so people reading or reviewing our code know roughly what is coming up so there are no surprises.
import mysql.connector
from mysql.connector import Error
import pandas as pd
Connecting to MySQL Server
def create_server_connection(host_name, user_name, user_password):
connection = None
try:
connection = mysql.connector.connect(
host=host_name,
user=user_name,
passwd=user_password
)
print("MySQL Database connection successful")
except Error as err:
print(f"Error: '{err}'")
return connection
Creating a re-usable function for code like this is best practice, so that we can use this again and again with minimum effort. Once this is written once you can re-use it in all of your projects in the future too, so future-you will be grateful!
Let's go through this line by line so we understand what's happening here:
The first line is us naming the function (create_server_connection) and naming the arguments that that function will take (host_name, user_name and user_password).
The next line closes any existing connections so that the server doesn't become confused with multiple open connections.
Next we use a Python try-except block to handle any potential errors. The first part tries to create a connection to the server using the mysql.connector.connect() method using the details specified by the user in the arguments. If this works, the function prints a happy little success message.
The except part of the block prints the error which MySQL Server returns, in the unfortunate circumstance that there is an error.
Finally, if the connection is successful, the function returns a connection object.
connection = create_server_connection("localhost", "root", 'mypassword')
Creating a New Database
Now that we have established a connection, our next step is to create a new database on our server.
In this tutorial we will do this only once, but again we will write this as a re-usable function so we have a nice useful function we can re-use for future projects.
def create_database(connection, query):
cursor = connection.cursor()
try:
cursor.execute(query)
print("Database created successfully")
except Error as err:
print(f"Error: '{err}'")
This function takes two arguments, connection (our connection object) and query (a SQL query which we will write in the next step). It executes the query in the server via the connection.
We use the cursor method on our connection object to create a cursor object (MySQL Connector uses an object-oriented programming paradigm, so there are lots of objects inheriting properties from parent objects).
create_database_query="create database school_db"
create_database(connection,create_database_query)
Connecting to the Database
Now that we have created a database in MySQL Server, we can modify our create_server_connection function to connect directly to this database.
Note that it's possible - common, in fact - to have multiple databases on one MySQL Server, so we want to always and automatically connect to the database we're interested in.
def create_db_connection(host_name, user_name, user_password, db_name):
connection = None
try:
connection = mysql.connector.connect(
host=host_name,
user=user_name,
passwd=user_password,
database=db_name
)
print("MySQL Database connection successful")
except Error as err:
print(f"Error: '{err}'")
return connection
This is the exact same function, but now we take one more argument - the database name - and pass that as an argument to the connect() method.
Creating a Query Execution Function
The final function we're going to create (for now) is an extremely vital one - a query execution function. This is going to take our SQL queries, stored in Python as strings, and pass them to the cursor.execute() method to execute them on the server.
def execute_query(connection, query):
cursor = connection.cursor()
try:
cursor.execute(query)
connection.commit()
print("Query successful")
except Error as err:
print(f"Error: '{err}'")
This function is exactly the same as our create_database function from earlier, except that it uses the connection.commit() method to make sure that the commands detailed in our SQL queries are implemented.
This is going to be our workhorse function, which we will use (alongside create_db_connection) to create tables, establish relationships between those tables, populate the tables with data, and update and delete records in our database.
Creating Tables
Now we're all set to start running SQL commands into our Server and to start building our database. The first thing we want to do is to create the necessary tables.
create_teacher_table = """
CREATE TABLE teacher (
teacher_id INT PRIMARY KEY,
first_name VARCHAR(40) NOT NULL,
last_name VARCHAR(40) NOT NULL,
language_1 VARCHAR(3) NOT NULL,
language_2 VARCHAR(3),
dob DATE,
tax_id INT UNIQUE,
phone_no VARCHAR(20)
);
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db') # Connect to the Database
execute_query(connection, create_teacher_table) # Execute our defined query
Now let's create the remaining tables.
create_client_table = """
CREATE TABLE client (
client_id INT PRIMARY KEY,
client_name VARCHAR(40) NOT NULL,
address VARCHAR(60) NOT NULL,
industry VARCHAR(20)
);
"""
create_participant_table = """
CREATE TABLE participant (
participant_id INT PRIMARY KEY,
first_name VARCHAR(40) NOT NULL,
last_name VARCHAR(40) NOT NULL,
phone_no VARCHAR(20),
client INT
);
"""
create_course_table = """
CREATE TABLE course (
course_id INT PRIMARY KEY,
course_name VARCHAR(40) NOT NULL,
language VARCHAR(3) NOT NULL,
level VARCHAR(2),
course_length_weeks INT,
start_date DATE,
in_school BOOLEAN,
teacher INT,
client INT
);
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
execute_query(connection, create_client_table)
execute_query(connection, create_participant_table)
execute_query(connection, create_course_table)
This creates the four tables necessary for our four entities.
Now we want to define the relationships between them and create one more table to handle the many-to-many relationship between the participant and course tables (see here for more details).
We do this in exactly the same way
alter_participant = """
ALTER TABLE participant
ADD FOREIGN KEY(client)
REFERENCES client(client_id)
ON DELETE SET NULL;
"""
alter_course = """
ALTER TABLE course
ADD FOREIGN KEY(teacher)
REFERENCES teacher(teacher_id)
ON DELETE SET NULL;
"""
alter_course_again = """
ALTER TABLE course
ADD FOREIGN KEY(client)
REFERENCES client(client_id)
ON DELETE SET NULL;
"""
create_takescourse_table = """
CREATE TABLE takes_course (
participant_id INT,
course_id INT,
PRIMARY KEY(participant_id, course_id),
FOREIGN KEY(participant_id) REFERENCES participant(participant_id) ON DELETE CASCADE,
FOREIGN KEY(course_id) REFERENCES course(course_id) ON DELETE CASCADE
);
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
execute_query(connection, alter_participant)
execute_query(connection, alter_course)
execute_query(connection, alter_course_again)
execute_query(connection, create_takescourse_table)
Now our tables are created, along with the appropriate constraints, primary key, and foreign key relations.
Inserting Into Tables
The next step is to add some records to the tables. Again we use execute_query to feed our existing SQL commands into the Server. Let's again start with the Teacher table.
insert_teacher = """
INSERT INTO teacher VALUES
(1, 'James', 'Smith', 'ENG', NULL, '1985-04-20', 12345, '+491774553676'),
(2, 'Stefanie', 'Martin', 'FRA', NULL, '1970-02-17', 23456, '+491234567890'),
(3, 'Steve', 'Wang', 'MAN', 'ENG', '1990-11-12', 34567, '+447840921333'),
(4, 'Friederike', 'Müller-Rossi', 'DEU', 'ITA', '1987-07-07', 45678, '+492345678901'),
(5, 'Isobel', 'Ivanova', 'RUS', 'ENG', '1963-05-30', 56789, '+491772635467'),
(6, 'Niamh', 'Murphy', 'ENG', 'IRI', '1995-09-08', 67890, '+491231231232');
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
execute_query(connection, insert_teacher)
lets insert remaing data to tables
insert_client = """
INSERT INTO client VALUES
(101, 'Big Business Federation', '123 Falschungstraße, 10999 Berlin', 'NGO'),
(102, 'eCommerce GmbH', '27 Ersatz Allee, 10317 Berlin', 'Retail'),
(103, 'AutoMaker AG', '20 Künstlichstraße, 10023 Berlin', 'Auto'),
(104, 'Banko Bank', '12 Betrugstraße, 12345 Berlin', 'Banking'),
(105, 'WeMoveIt GmbH', '138 Arglistweg, 10065 Berlin', 'Logistics');
"""
insert_participant = """
INSERT INTO participant VALUES
(101, 'Marina', 'Berg','491635558182', 101),
(102, 'Andrea', 'Duerr', '49159555740', 101),
(103, 'Philipp', 'Probst', '49155555692', 102),
(104, 'René', 'Brandt', '4916355546', 102),
(105, 'Susanne', 'Shuster', '49155555779', 102),
(106, 'Christian', 'Schreiner', '49162555375', 101),
(107, 'Harry', 'Kim', '49177555633', 101),
(108, 'Jan', 'Nowak', '49151555824', 101),
(109, 'Pablo', 'Garcia', '49162555176', 101),
(110, 'Melanie', 'Dreschler', '49151555527', 103),
(111, 'Dieter', 'Durr', '49178555311', 103),
(112, 'Max', 'Mustermann', '49152555195', 104),
(113, 'Maxine', 'Mustermann', '49177555355', 104),
(114, 'Heiko', 'Fleischer', '49155555581', 105);
"""
insert_course = """
INSERT INTO course VALUES
(12, 'English for Logistics', 'ENG', 'A1', 10, '2020-02-01', TRUE, 1, 105),
(13, 'Beginner English', 'ENG', 'A2', 40, '2019-11-12', FALSE, 6, 101),
(14, 'Intermediate English', 'ENG', 'B2', 40, '2019-11-12', FALSE, 6, 101),
(15, 'Advanced English', 'ENG', 'C1', 40, '2019-11-12', FALSE, 6, 101),
(16, 'Mandarin für Autoindustrie', 'MAN', 'B1', 15, '2020-01-15', TRUE, 3, 103),
(17, 'Français intermédiaire', 'FRA', 'B1', 18, '2020-04-03', FALSE, 2, 101),
(18, 'Deutsch für Anfänger', 'DEU', 'A2', 8, '2020-02-14', TRUE, 4, 102),
(19, 'Intermediate English', 'ENG', 'B2', 10, '2020-03-29', FALSE, 1, 104),
(20, 'Fortgeschrittenes Russisch', 'RUS', 'C1', 4, '2020-04-08', FALSE, 5, 103);
"""
insert_takescourse = """
INSERT INTO takes_course VALUES
(101, 15),
(101, 17),
(102, 17),
(103, 18),
(104, 18),
(105, 18),
(106, 13),
(107, 13),
(108, 13),
(109, 14),
(109, 15),
(110, 16),
(110, 20),
(111, 16),
(114, 12),
(112, 19),
(113, 19);
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
execute_query(connection, insert_client)
execute_query(connection, insert_participant)
execute_query(connection, insert_course)
execute_query(connection, insert_takescourse)
Reading Data
Now we have a functional database to work with. As a Data Analyst, you are likely to come into contact with existing databases in the organisations where you work. It will be very useful to know how to pull data out of those databases so it can then be fed into your python data pipeline. This is what we are going to work on next.
For this, we will need one more function, this time using cursor.fetchall() instead of cursor.commit(). With this function, we are reading data from the database and will not be making any changes.
def read_query(connection, query):
cursor = connection.cursor()
result = None
try:
cursor.execute(query)
result = cursor.fetchall()
return result
except Error as err:
print(f"Error: '{err}'")
Again, we are going to implement this in a very similar way to execute_query. Let's try it out with a simple query to see how it works.
q1 = """
SELECT *
FROM teacher;
"""
connection = create_db_connection("localhost", "root", pw, db)
results = read_query(connection, q1)
for result in results:
print(result)
lets select data by using joins
q2 = """
SELECT course.course_id, course.course_name, course.language, client.client_name, client.address
FROM course
JOIN client
ON course.client = client.client_id
WHERE course.in_school = FALSE;
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
results = read_query(connection, q2)
for result in results:
print(result)
Formatting Output into a List
#Initialise empty list
from_db = []
# Loop over the results and append them into our list
# Returns a list of tuples
for result in results:
result = result
from_db.append(result)
print(from_db)
Formatting Output into a List of Lists
# Returns a list of lists
from_db = []
for result in results:
result = list(result)
from_db.append(result)
print(from_db)
Formatting Output into a pandas DataFrame
For Data Analysts using Python, pandas is our beautiful and trusted old friend. It's very simple to convert the output from our database into a DataFrame, and from there the possibilities are endless!
# Returns a list of lists and then creates a pandas DataFrame
from_db = []
for result in results:
result = list(result)
from_db.append(result)
columns = ["course_id", "course_name", "language", "client_name", "address"]
df = pd.DataFrame(from_db, columns=columns)
display(df)
print(df)
Updating Records
When we are maintaining a database, we will sometimes need to make changes to existing records. In this section we are going to look at how to do that.
Let's say the ILS is notified that one of its existing clients, the Big Business Federation, is moving offices to 23 Fingiertweg, 14534 Berlin. In this case, the database administrator (that's us!) will need to make some changes.
Thankfully, we can do this with our execute_query function alongside the SQL UPDATE statement.
update = """
UPDATE client
SET address = '23 Fingiertweg, 14534 Berlin'
WHERE client_id = 101;
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
execute_query(connection, update)
Deleting Records
It is also possible use our execute_query function to delete records, by using DELETE.
When using SQL with relational databases, we need to be careful using the DELETE operator. This isn't Windows, there is no 'Are you sure you want to delete this?' warning pop-up, and there is no recycling bin. Once we delete something, it's really gone.
delete_course = """
DELETE FROM course
WHERE course_id = 20;
"""
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
execute_query(connection, delete_course)
Creating Records from Lists
We saw when populating our tables that we can use the SQL INSERT command in our execute_query function to insert records into our database.
Given that we're using Python to manipulate our SQL database, it would be useful to be able to take a Python data structure (such as a list) and insert that directly into our database.
This could be useful when we want to store logs of user activity on a social media app we have written in Python, or input from users into a Wiki we have built, for example. There are as many possible uses for this as you can think of.
This method is also more secure if our database is open to our users at any point, as it helps to prevent against SQL Injection attacks, which can damage or even destroy our whole database.
To do this, we will write a function using the executemany() method, instead of the simpler execute() method we have been using thus far.
def execute_list_query(connection, sql, val):
cursor = connection.cursor()
try:
cursor.executemany(sql, val)
connection.commit()
print("Query successful")
except Error as err:
print(f"Error: '{err}'")
Now we have the function, we need to define an SQL command ('sql') and a list containing the values we wish to enter into the database ('val'). The values must be stored as a list of tuples, which is a fairly common way to store data in Python.
To add two new teachers to the database, we can write some code like this:
sql = '''
INSERT INTO teacher (teacher_id, first_name, last_name, language_1, language_2, dob, tax_id, phone_no)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
'''
val = [
(7, 'Hank', 'Dodson', 'ENG', None, '1991-12-23', 11111, '+491772345678'),
(8, 'Sue', 'Perkins', 'MAN', 'ENG', '1976-02-02', 22222, '+491443456432')
]
connection = create_db_connection("localhost", "root", 'mypassword', 'school_db')
execute_list_query(connection, sql, val)
How to generate s3 presigned urls using python
Feb. 19, 2021, 4:25 a.m.
335How To Generate S3 PreSigned Urls Using Python
import os
import logging
import boto3
from botocore.client import Config
from botocore.exceptions import ClientError
# python > 3 should be installed
# pip install boto3
# s3v4
# (Default) Signature Version 4
# v4 algorithm starts with X-Amz-Algorithm
#
# s3
# (Deprecated) Signature Version 2, this only works in some regions new regions not supported
# if you have to generate signed url that has > 7 days expiry then use version 2 if your region supports it
s3_signature ={
'v4':'s3v4',
'v2':'s3'
}
# Below is optional you do not need these additional variables as boto3 supports
# reading values from env variables. This is for illustration purpose
AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY')
AWS_DEFAULT_REGION = os.getenv('AWS_DEFAULT_REGION')
def create_presigned_url(bucket_name, bucket_key, expiration=3600, signature_version=s3_signature['v4']):
"""Generate a presigned URL for the S3 object
:param bucket_name: string
:param bucket_key: string
:param expiration: Time in seconds for the presigned URL to remain valid
:param signature_version: string
:return: Presigned URL as string. If error, returns None.
"""
s3_client = boto3.client('s3',
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
config=Config(signature_version=signature_version),
region_name=AWS_DEFAULT_REGION
)
try:
response = s3_client.generate_presigned_url('get_object',
Params={'Bucket': bucket_name,
'Key': bucket_key},
ExpiresIn=expiration)
print(s3_client.list_buckets()['Owner'])
for key in s3_client.list_objects(Bucket=bucket_name, Prefix=bucket_key)['Contents']:
print(key['Key'])
except ClientError as e:
logging.error(e)
return None
# The response contains the presigned URL
return response
weeks = 8
seven_days_as_seconds = 604800
generated_signed_url = create_presigned_url('djangosimplified', 'downloads/whitepaper.pdf', seven_days_as_seconds, s3_signature['v4'])
print(generated_signed_url)
Creating a simple currency converter in python by using api
Jan. 25, 2021, 1:20 a.m.
231creating a simple currency converter in python by using api
Rates API is a free service for current and historical foreign exchange rates built on top of data published by European Central Bank. Rates API is compatible with any application and programming languages.
install:
pip install requests-html
getting base url
base_url="https://api.exchangeratesapi.io/latest"
import requests
response=requests.get(base_url)
investigateing response
response.ok
response.status_code
response.text
response.content
handling json
response.json()
type(response.json())
import json
json.dumps(response.json(),indent=4)
print(json.dumps(response.json(),indent=4))
response.json().keys()
parameters in the get request
param_url=base_url+"?symbols=USD,GBP"
param_url
response=requests.get(param_url)
response
data=response.json()
data
data['base']
data['date']
data['rates']
param_url=base_url+"?symbols=GBP"+"&"+"base=USD"
param_url
data=requests.get(param_url).json()
data
usd_to_gbp=data['rates']['GBP']
usd_to_gbp
obtainting historical exchange rates
base_url="https://api.exchangeratesapi.io"
base_url
historical_url=base_url+"/2016-01-26"
historical_url
response=requests.get(historical_url)
response.status_code
data=response.json()
print(json.dumps(data,indent=4))
extracting data for a time period
time_period=base_url+'/history'+'?start_at=2017-04-26&end_at=2018-04-26'+'&symbols=GBP'
time_period
data=requests.get(time_period).json()
data
print(json.dumps(data,indent=4,sort_keys=True))
testing the api response to incorrect input
invalid_url=base_url+'/2019-13-05'
invalid_url
response=requests.get(invalid_url)
response
response=requests.get(invalid_url)
response.status_code
response.json()
creating a simple currency converter
date=input('pls enter date format YYYY-MM-DD or type latest :')
base=input('currency convert from :')
curr=input('currency convert to :')
quantity=float(input('how much do u want to convert: {}'.format(base)))
url="https://api.exchangeratesapi.io/"+date+'?base='+base+'&symbols='+curr
response=requests.get(url)
response
if response.ok is False:
print('error {}'.format(response.status_code))
print(response.json()['error'])
else :
data=response.json()
rate=data['rates'][curr]
result=str(quantity*rate)
print('\n {} {} is equal to {:.4} {}, based on exchage rate is {}'.format(quantity,base,result,curr,data['date']))
output:
pls enter date format YYYY-MM-DD or type latest :latest
currency convert from :USD
currency convert to :INR
how much do u want to convert: USD1
1.0 USD is equal to 73.0 INR, based on exchage rate is 2021-01-22
Internet speed with plotly and matplotlib in python
Jan. 16, 2021, 10:31 a.m.
201Internet speed with plotly and matplotlib in python
install:
pip install matplotlib
pip install speedtest-cli
pip install plotly
So by creating a new instance of speedtest as s and testing the upload and download speed we are given the upload and download speed in bits per second. To convert this to megabits per second (Mb/s) we can do the following to include the time of the test too:
import speedtest
import datetime
import time
s = speedtest.Speedtest()
while True:
time_now = datetime.datetime.now().strftime("%H:%M:%S")
downspeed = round((round(s.download()) / 1048576), 2)
upspeed = round((round(s.upload()) / 1048576), 2)
print(f"time: {time_now}, downspeed: {downspeed} Mb/s, upspeed: {upspeed} Mb/s")
# 60 seconds sleep
time.sleep(60)
output:
time: 12:44:15, downspeed: 95.04 Mb/s, upspeed: 32.85 Mb/s
time: 12:44:35, downspeed: 99.46 Mb/s, upspeed: 38.76 Mb/s
time: 12:44:56, downspeed: 100.59 Mb/s, upspeed: 38.94 Mb/s
Now we will move on to recording this in a CSV file. CSVs are large text files which values separated by commas
In order to record into a csv file in python we need to import the CSV package and ‘open’ a CSV file (if one doesn’t exist it will create one).
mynet_speed.py
import speedtest
import datetime
import csv
import time
s = speedtest.Speedtest()
with open('test.csv', mode='w') as speedcsv:
csv_writer = csv.DictWriter(speedcsv, fieldnames=['time', 'downspeed', 'upspeed'])
csv_writer.writeheader()
while True:
time_now = datetime.datetime.now().strftime("%H:%M:%S")
downspeed = round((round(s.download()) / 1048576), 2)
upspeed = round((round(s.upload()) / 1048576), 2)
csv_writer.writerow({
'time': time_now,
'downspeed': downspeed,
"upspeed": upspeed
})
# 60 seconds sleep
time.sleep(60)
So while you let this code run for 4-5 minutes we can discuss what is going on. Line 7 with open essentiallly creates a csv file with the name test.csv with the headers name, downspeed and upspeed and writes them into the csv. Then the loop begins and every time a test is performed by speedtest it writes a new row into the csv with the time, download speed and upload speed we specified before. So let’s go and look at that now.
time,downspeed,upspeed
12:51:16,99.29,38.66
12:51:37,100.67,38.79
12:51:57,99.7,38.79
12:52:17,92.89,31.99
12:52:38,99.4,38.96
Let’s make another python file to generate the graph of our internet connection. This is where we will use matplotlib.
my_net_graph.py
import matplotlib.pyplot as plt
import csv
import matplotlib.ticker as ticker
times = []
download = []
upload = []
with open('test.csv', 'r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
next(csvfile)
res = [ele for ele in plots if ele != []]
for row in res:
times.append(str(row[0]))
download.append(float(row[1]))
upload.append(float(row[2]))
print(times, "\n", download, "\n", upload)
output:
['12:51:16', '12:51:37', '12:51:57', '12:52:17', '12:52:38']
[99.29, 100.67, 99.7, 92.89, 99.4]
[38.66, 38.79, 38.79, 31.99, 38.96]
So now we are parsing our data! The next(csvfile)
essentially skips the row of headers (that were for our benefit only, not python’s). Now we come on to using matplotlib
which I am by no standards an expert on. Their documentation is extensive.
plt.figure(30)
plt.plot(times, download, label='download', color='r')
plt.plot(times, upload, label='upload', color='b')
plt.xlabel('time')
plt.ylabel('speed in Mb/s')
plt.title("internet speed")
plt.legend()
plt.savefig('test_graph.jpg', bbox_inches='tight')
for matplotlib:
import matplotlib.pyplot as plt
import csv
import matplotlib.ticker as ticker
times = []
download = []
upload = []
with open('test.csv', 'r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
next(csvfile)
res = [ele for ele in plots if ele != []]
for row in res:
times.append(str(row[0]))
download.append(float(row[1]))
upload.append(float(row[2]))
print(times, "\n", download, "\n", upload)
plt.figure(30)
plt.plot(times, download, label='download', color='r')
plt.plot(times, upload, label='upload', color='b')
plt.xlabel('time')
plt.ylabel('speed in Mb/s')
plt.title("internet speed")
plt.legend()
plt.savefig('test_graph.jpg', bbox_inches='tight')
For plotly:
import plotly
import plotly.graph_objs as go
import csv
times = []
download = []
upload = []
with open('test.csv', 'r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
next(csvfile)
res = [ele for ele in plots if ele != []]
for row in res:
times.append(str(row[0]))
download.append(float(row[1]))
upload.append(float(row[2]))
print(times, "\n", download, "\n", upload)
# Create traces
trace0 = go.Scatter(
x = times,
y = download,
mode = 'lines+markers',
name = 'Download'
)
trace1 = go.Scatter(
x = times,
y = upload,
mode = 'lines+markers',
name = 'Upload'
)
data = [trace0, trace1]
plotly.offline.plot(data, filename='scatter-mode')
How to extract tables from image in python
Jan. 14, 2021, 9:27 p.m.
557How to extract tables from image in python
install:
pip install opencv-python
pip install pytesseract
pip install openpyxl
Pytesseract : “TesseractNotFound Error: tesseract is not installed or it's not in your path”, how do I fix this?
for Windows:
1. Install tesseract using windows installer available at: https://github.com/UB-Mannheim/tesseract/wiki
2. Note the tesseract path from the installation. Default installation path at the time of this edit was: C:\Users\USER\AppData\Local\Tesseract-OCR
. It may change so please check the installation path.
3. pip install pytesseract
4. Set the tesseract path in the script before calling image_to_string
:
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\USER\AppData\Local\Tesseract-OCR\tesseract.exe'
If you are using Ubuntu install tesseract using following command:
sudo apt-get install tesseract-ocr
For mac:
brew install tesseract
On Linux
sudo apt-get update
sudo apt-get install libleptonica-dev
sudo apt-get install tesseract-ocr tesseract-ocr-dev
sudo apt-get install libtesseract-dev
Then you should install python package using pip:
pip install tesseract
pip install tesseract-ocr
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
The first step is to read in your file from the proper path, using thresholding to convert the input image to a binary image and inverting it to get a black background and white lines and fonts.
#read your file
file=r'/Users/YOURPATH/testcv.png'
img = cv2.imread(file,0)
img.shape
#thresholding the image to a binary image
thresh,img_bin = cv2.threshold(img,128,255,cv2.THRESH_BINARY |cv2.THRESH_OTSU)
#inverting the image
img_bin = 255-img_bin
cv2.imwrite('/Users/YOURPATH/cv_inverted.png',img_bin)
#Plotting the image to see the output
plotting = plt.imshow(img_bin,cmap='gray')
plt.show()
The next step is to define a kernel to detect rectangular boxes, and followingly the tabular structure. First, we define the length of the kernel and following the vertical and horizontal kernels to detect later on all vertical lines and all horizontal lines.
# Length(width) of kernel as 100th of total width
kernel_len = np.array(img).shape[1]//100
# Defining a vertical kernel to detect all vertical lines of image
ver_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_len))
# Defining a horizontal kernel to detect all horizontal lines of image
hor_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_len, 1))
# A kernel of 2x2
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
The next step is the detection of the vertical lines.
#Use vertical kernel to detect and save the vertical lines in a jpg
image_1 = cv2.erode(img_bin, ver_kernel, iterations=3)
vertical_lines = cv2.dilate(image_1, ver_kernel, iterations=3)
cv2.imwrite("/Users/YOURPATH/vertical.jpg",vertical_lines)
#Plot the generated image
plotting = plt.imshow(image_1,cmap='gray')
plt.show()
And now the same for all horizontal lines.
#Use horizontal kernel to detect and save the horizontal lines in a jpg
image_2 = cv2.erode(img_bin, hor_kernel, iterations=3)
horizontal_lines = cv2.dilate(image_2, hor_kernel, iterations=3)
cv2.imwrite("/Users/YOURPATH/horizontal.jpg",horizontal_lines)
#Plot the generated image
plotting = plt.imshow(image_2,cmap='gray')
plt.show()
We combine the horizontal and vertical lines to a third image, by weighting both with 0.5. The aim is to get a clear tabular structure to detect each cell.
# Combine horizontal and vertical lines in a new third image, with both having same weight.
img_vh = cv2.addWeighted(vertical_lines, 0.5, horizontal_lines, 0.5, 0.0)
#Eroding and thesholding the image
img_vh = cv2.erode(~img_vh, kernel, iterations=2)
thresh, img_vh = cv2.threshold(img_vh,128,255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
cv2.imwrite("/Users/YOURPATH/img_vh.jpg", img_vh)
bitxor = cv2.bitwise_xor(img,img_vh)
bitnot = cv2.bitwise_not(bitxor)
#Plotting the generated image
plotting = plt.imshow(bitnot,cmap='gray')
plt.show()
After having the tabular structure we use the findContours function to detect the contours. This helps us to retrieve the exact coordinates of each box.
# Detect contours for following box detection
contours, hierarchy = cv2.findContours(img_vh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
The following function is necessary to get a sequence of the contours and to sort them from top-to-bottom (https://www.pyimagesearch.com/2015/04/20/sorting-contours-using-python-and-opencv/).
def sort_contours(cnts, method="left-to-right"):
# initialize the reverse flag and sort index
reverse = False
i = 0
# handle if we need to sort in reverse
if method == "right-to-left" or method == "bottom-to-top":
reverse = True
# handle if we are sorting against the y-coordinate rather than
# the x-coordinate of the bounding box
if method == "top-to-bottom" or method == "bottom-to-top":
i = 1
# construct the list of bounding boxes and sort them from top to
# bottom
boundingBoxes = [cv2.boundingRect(c) for c in cnts]
(cnts, boundingBoxes) = zip(*sorted(zip(cnts, boundingBoxes),
key=lambda b:b[1][i], reverse=reverse))
# return the list of sorted contours and bounding boxes
return (cnts, boundingBoxes)
# Sort all the contours by top to bottom.
contours, boundingBoxes = sort_contours(contours, method="top-to-bottom")
How to retrieve the cells position
The further steps are necessary to define the right location, which means proper column and row, of each cell. First, we need to retrieve the height for each cell and store it in the list heights. Then we take the mean from the heights.
#Creating a list of heights for all detected boxes
heights = [boundingBoxes[i][3] for i in range(len(boundingBoxes))]
#Get mean of heights
mean = np.mean(heights)
Next we retrieve the position, width and height of each contour and store it in the box list. Then we draw rectangles around all our boxes and plot the image. In my case I only did it for boxes smaller then a width of 1000 px and a height of 500 px to neglect rectangles which might be no cells, e.g. the table as a whole. These two values depend on your image size, so in case your image is a lot smaller or bigger you need to adjust both.
#Create list box to store all boxes in
box = []
# Get position (x,y), width and height for every contour and show the contour on image
for c in contours:
x, y, w, h = cv2.boundingRect(c)
if (w<1000 and h<500):
image = cv2.rectangle(img,(x,y),(x+w,y+h),(0,255,0),2)
box.append([x,y,w,h])
plotting = plt.imshow(image,cmap=’gray’)
plt.show()
Now as we have every cell, its location, height and width we need to get the right location within the table. Therefore, we need to know in which row and which column it is located. As long as a box does not differ more than its own (height + mean/2) the box is in the same row. As soon as the height difference is higher than the current (height + mean/2) , we know that a new row starts. Columns are logically arranged from left to right.
#Creating two lists to define row and column in which cell is located
row=[]
column=[]
j=0
#Sorting the boxes to their respective row and column
for i in range(len(box)):
if(i==0):
column.append(box[i])
previous=box[i]
else:
if(box[i][1]<=previous[1]+mean/2):
column.append(box[i])
previous=box[i]
if(i==len(box)-1):
row.append(column)
else:
row.append(column)
column=[]
previous = box[i]
column.append(box[i])
print(column)
print(row)
Next we calculate the maximum number of columns (meaning cells) to understand how many columns our final dataframe/table will have.
#calculating maximum number of cells
countcol = 0
for i in range(len(row)):
countcol = len(row[i])
if countcol > countcol:
countcol = countcol
After having the maximum number of cells we store the midpoint of each column in a list, create an array and sort the values.
#Retrieving the center of each column
center = [int(row[i][j][0]+row[i][j][2]/2) for j in range(len(row[i])) if row[0]]
center=np.array(center)
center.sort()
At this point, we have all boxes and their values, but as you might see in the output of your row list the values are not always sorted in the right order. That’s what we do next regarding the distance to the columns center. The proper sequence we store in the list finalboxes
#Regarding the distance to the columns center, the boxes are arranged in respective order
finalboxes = []
for i in range(len(row)):
lis=[]
for k in range(countcol):
lis.append([])
for j in range(len(row[i])):
diff = abs(center-(row[i][j][0]+row[i][j][2]/4))
minimum = min(diff)
indexing = list(diff).index(minimum)
lis[indexing].append(row[i][j])
finalboxes.append(lis)
Let’s extract the values
In the next step we make use of our list finalboxes. We take every image-based box, prepare it for Optical Character Recognition by dilating and eroding it and let pytesseract recognize the containing strings. The loop runs over every cell and stores the value in the outer list.
#from every single image-based cell/box the strings are extracted via pytesseract and stored in a list
outer=[]
for i in range(len(finalboxes)):
for j in range(len(finalboxes[i])):
inner=’’
if(len(finalboxes[i][j])==0):
outer.append(' ')
else:
for k in range(len(finalboxes[i][j])):
y,x,w,h = finalboxes[i][j][k][0],finalboxes[i][j][k][1], finalboxes[i][j][k][2],finalboxes[i][j][k][3]
finalimg = bitnot[x:x+h, y:y+w]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 1))
border = cv2.copyMakeBorder(finalimg,2,2,2,2, cv2.BORDER_CONSTANT,value=[255,255])
resizing = cv2.resize(border, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
dilation = cv2.dilate(resizing, kernel,iterations=1)
erosion = cv2.erode(dilation, kernel,iterations=1)
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe' #for windows only
out = pytesseract.image_to_string(erosion)
if(len(out)==0):
out = pytesseract.image_to_string(erosion, config='--psm 3')
inner = inner +" "+ out
outer.append(inner)
The last step is the conversion of the list to a dataframe and storing it into an excel-file.
#Creating a dataframe of the generated OCR list
arr = np.array(outer)
dataframe = pd.DataFrame(arr.reshape(len(row),countcol))
print(dataframe)
data = dataframe.style.set_properties(align="left")
#Converting it in a excel-file
data.to_excel(“/Users/YOURPATH/output.xlsx”)
Thats it.
Finall code aftet combineing all.
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
#read your file
file=r'/Users/kwikl3arn/Desktop/roseflower.png'
img = cv2.imread(file,0)
img.shape
#thresholding the image to a binary image
thresh,img_bin = cv2.threshold(img,128,255,cv2.THRESH_BINARY | cv2.THRESH_OTSU)
#inverting the image
img_bin = 255-img_bin
cv2.imwrite('/Users/kwikl3arn/Desktop/cv_inverted.png',img_bin)
#Plotting the image to see the output
plotting = plt.imshow(img_bin,cmap='gray')
plt.show()
# countcol(width) of kernel as 100th of total width
kernel_len = np.array(img).shape[1]//100
# Defining a vertical kernel to detect all vertical lines of image
ver_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_len))
# Defining a horizontal kernel to detect all horizontal lines of image
hor_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_len, 1))
# A kernel of 2x2
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
#Use vertical kernel to detect and save the vertical lines in a jpg
image_1 = cv2.erode(img_bin, ver_kernel, iterations=3)
vertical_lines = cv2.dilate(image_1, ver_kernel, iterations=3)
cv2.imwrite("/Users/kwikl3arn/Desktop/vertical.jpg",vertical_lines)
#Plot the generated image
plotting = plt.imshow(image_1,cmap='gray')
plt.show()
#Use horizontal kernel to detect and save the horizontal lines in a jpg
image_2 = cv2.erode(img_bin, hor_kernel, iterations=3)
horizontal_lines = cv2.dilate(image_2, hor_kernel, iterations=3)
cv2.imwrite("/Users/kwikl3arn/Desktop/horizontal.jpg",horizontal_lines)
#Plot the generated image
plotting = plt.imshow(image_2,cmap='gray')
plt.show()
# Combine horizontal and vertical lines in a new third image, with both having same weight.
img_vh = cv2.addWeighted(vertical_lines, 0.5, horizontal_lines, 0.5, 0.0)
#Eroding and thesholding the image
img_vh = cv2.erode(~img_vh, kernel, iterations=2)
thresh, img_vh = cv2.threshold(img_vh,128,255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
cv2.imwrite("/Users/kwikl3arn/Desktop/img_vh.jpg", img_vh)
bitxor = cv2.bitwise_xor(img,img_vh)
bitnot = cv2.bitwise_not(bitxor)
#Plotting the generated image
plotting = plt.imshow(bitnot,cmap='gray')
plt.show()
# Detect contours for following box detection
contours, hierarchy = cv2.findContours(img_vh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
def sort_contours(cnts, method="left-to-right"):
# initialize the reverse flag and sort index
reverse = False
i = 0
# handle if we need to sort in reverse
if method == "right-to-left" or method == "bottom-to-top":
reverse = True
# handle if we are sorting against the y-coordinate rather than
# the x-coordinate of the bounding box
if method == "top-to-bottom" or method == "bottom-to-top":
i = 1
# construct the list of bounding boxes and sort them from top to
# bottom
boundingBoxes = [cv2.boundingRect(c) for c in cnts]
(cnts, boundingBoxes) = zip(*sorted(zip(cnts, boundingBoxes),
key=lambda b:b[1][i], reverse=reverse))
# return the list of sorted contours and bounding boxes
return (cnts, boundingBoxes)
# Sort all the contours by top to bottom.
contours, boundingBoxes = sort_contours(contours, method="top-to-bottom")
#Creating a list of heights for all detected boxes
heights = [boundingBoxes[i][3] for i in range(len(boundingBoxes))]
#Get mean of heights
mean = np.mean(heights)
#Create list box to store all boxes in
box = []
# Get position (x,y), width and height for every contour and show the contour on image
for c in contours:
x, y, w, h = cv2.boundingRect(c)
if (w<1000 and h<500):
image = cv2.rectangle(img,(x,y),(x+w,y+h),(0,255,0),2)
box.append([x,y,w,h])
plotting = plt.imshow(image,cmap='gray')
plt.show()
#Creating two lists to define row and column in which cell is located
row=[]
column=[]
j=0
#Sorting the boxes to their respective row and column
for i in range(len(box)):
if(i==0):
column.append(box[i])
previous=box[i]
else:
if(box[i][1]<=previous[1]+mean/2):
column.append(box[i])
previous=box[i]
if(i==len(box)-1):
row.append(column)
else:
row.append(column)
column=[]
previous = box[i]
column.append(box[i])
print(column)
print(row)
#calculating maximum number of cells
countcol = 0
for i in range(len(row)):
countcol = len(row[i])
if countcol > countcol:
countcol = countcol
#Retrieving the center of each column
center = [int(row[i][j][0]+row[i][j][2]/2) for j in range(len(row[i])) if row[0]]
center=np.array(center)
center.sort()
print(center)
#Regarding the distance to the columns center, the boxes are arranged in respective order
finalboxes = []
for i in range(len(row)):
lis=[]
for k in range(countcol):
lis.append([])
for j in range(len(row[i])):
diff = abs(center-(row[i][j][0]+row[i][j][2]/4))
minimum = min(diff)
indexing = list(diff).index(minimum)
lis[indexing].append(row[i][j])
finalboxes.append(lis)
#from every single image-based cell/box the strings are extracted via pytesseract and stored in a list
outer=[]
for i in range(len(finalboxes)):
for j in range(len(finalboxes[i])):
inner=''
if(len(finalboxes[i][j])==0):
outer.append(' ')
else:
for k in range(len(finalboxes[i][j])):
y,x,w,h = finalboxes[i][j][k][0],finalboxes[i][j][k][1], finalboxes[i][j][k][2],finalboxes[i][j][k][3]
finalimg = bitnot[x:x+h, y:y+w]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 1))
border = cv2.copyMakeBorder(finalimg,2,2,2,2, cv2.BORDER_CONSTANT,value=[255,255])
resizing = cv2.resize(border, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
dilation = cv2.dilate(resizing, kernel,iterations=1)
erosion = cv2.erode(dilation, kernel,iterations=2)
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe' #for windows only
out = pytesseract.image_to_string(erosion)
if(len(out)==0):
out = pytesseract.image_to_string(erosion, config='--psm 3')
inner = inner +" "+ out
outer.append(inner)
#Creating a dataframe of the generated OCR list
arr = np.array(outer)
dataframe = pd.DataFrame(arr.reshape(len(row), countcol))
print(dataframe)
data = dataframe.style.set_properties(align="left")
#Converting it in a excel-file
data.to_excel("/Users/kwikl3arn/Desktop/output.xlsx")
Counting repeated characters in a string in python
Nov. 7, 2020, 1:45 a.m.
386Counting repeated characters in a string in Python
check_string = "i am checking this string to see how many times each character appears"
count = {}
for s in check_string:
if s in count:
count[s] += 1
else:
count[s] = 1
for key in count:
if count[key] > 1:
print (key+'=', count[key])
Output:-
i= 5
= 12
a= 7
m= 3
c= 5
h= 5
e= 7
n= 3
g= 2
t= 5
s= 5
r= 4
o= 2
p= 2
Another method;-
from collections import Counter
string = "ihavesometextbutidontmindsharing"
print(Counter(string))
Output:
{'i': 4, 't': 4, 'e': 3, 'n': 3, 's': 2, 'h': 2, 'm': 2, 'o': 2, 'a': 2, 'd': 2, 'x': 1, 'r': 1, 'u': 1, 'b': 1, 'v': 1, 'g': 1}
Cron job using python
Aug. 27, 2020, 6:08 p.m.
358cron job using python
Cron is a system daemon used to execute desired tasks (in the background) at designated times. A crontab is a simple text file with a list of commands meant to be run at specified time. These commands and their run times are then controlled by cron daemon, which executes them in the system background. Each user has a crontab file which specifies actions and times at which they should be executed, these jobs will run regardless of whether user is actually logged into the system or not. There is also a root crontab for tasks requiring administrative privileges. This system crontab allows scheduling of systemwide tasks such as log rotations and system database updates.
Usually we intend to handle cron daemon in a controlled way. One use case is when we just want to supply a command and set a cronjob without editing file manually. A python library python-crontab provides a simple and effective way to access a crontab from python utils, allowing programmer to load cron jobs as objects, search for them and save manipulations.
Installation
The package can be installed directly using pip. Make sure you do not wrongly install crontab from pypi.
pip install python-crontab
Crontab Syntax
Cron uses a specific syntax to define the time schedules. It consists of five fields, which are separated by white spaces. The fields are:
Minute Hour Day Month Day_of_the_Week
The fields can have the following values:
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday to Saturday;
│ │ │ │ │ 7 is also Sunday on some systems)
│ │ │ │ │
│ │ │ │ │
* * * * * command to execute
Source: Wikipedia. Cron. Available at https://en.wikipedia.org/wiki/Cron
Cron also acccepts special characters so you can create more complex time schedules. The special characters have the following meanings:
Character | Meaning |
---|---|
Comma | To separate multiple values |
Hyphen | To indicate a range of values |
Asterisk | To indicate all possible values |
Forward slash | To indicate EVERY |
Let's see some examples:
* * * * *
means: every minute of every hour of every day of the month for every month for every day of the week.0 16 1,10,22 * *
tells cron to run a task at 4 PM (which is the 16th hour) on the 1st, 10th and 22nd day of every month.
Getting Access to Crontab
According to the crontab help page, there are five ways to include a job in cron. Of them, three work on Linux only, and two can also be used on Windows.
The first way to access cron is by using the username. The syntax is as follows:
cron = CronTab(user='username')
The other two Linux ways are:
cron = CronTab()
# or
cron = CronTab(user=True)
There are two more syntaxes that will also work on Windows.
In the first one, we call a task defined in the file "filename.tab":
cron = CronTab(tabfile='filename.tab')
In the second one, we define the task according to cron's syntax:
cron = CronTab(tab="""* * * * * command""")
How can I get the current user's username in Bash?
for linux
On the command line, enter
whoami
or
echo "$USER"
which command in Linux
which command in Linux is a command which is used to locate the executable file associated with the given command by searching it in the path environment variable
For example, to find the full path of the ping command , you would type the following:
which ping
The output will be something like this:
/bin/ping
How to print the current working directory
To print the current working directory run the pwd
command. The full path of the current working directory will be printed to standard output.
pwd
/home/dilip
Creating a New Job
Once we have accessed cron, we can create a new task by using the following command:
cron.new(command='my command')
Here, my command
defines the task to be executed via the command line.
We can also add a comment to our task. The syntax is as follows:
cron.new(command='my command', comment='my comment')
Let's see this in an example:
from crontab import CronTab
cron = CronTab(user='username')
job = cron.new(command='python example1.py')
job.minute.every(1)
cron.write()
In the above code we have first accessed cron via the username, and then created a job that consists of running a Python script named example1.py. In addition, we have set the task to be run every 1 minute. The write()
function adds our job to cron.
Setting Restrictions
One of the main advantages of using Python's crontab module is that we can set up time restrictions without having to use cron's syntax.
In the example above, we have already seen how to set running the job every minute. The syntax is as follows:
job.minute.every(minutes)
Similarly we could set up the hours:
job.hour.every(hours)
We can also set up the task to be run on certain days of the week. For example:
job.dow.on('SUN')
The above code will tell cron to run the task on Sundays, and the following code will tell cron to schedule the task on Sundays and Fridays:
job.dow.on('SUN', 'FRI')
Similarly, we can tell cron to run the task in specific months. For example:
job.month.during('APR', 'NOV')
This will tell cron to run the program in the months of April and November.
An important thing to consider is that each time we set a time restriction, we nullify the previous one. Thus, for example:
job.hour.every(5)
job.hour.every(7)
The above code will set the final schedule to run every seven hours, cancelling the previous schedule of five hours.
Unless, we append a schedule to a previous one, like this:
job.hour.every(15)
job.hour.also.on(3)
This will set the schedule as every 15 hours, and at 3 AM.
The 'every' condition can be a bit confusing at times. If we write job.hour.every(15)
, this will be equivalent to * */15 * * *
. As we can see, the minutes have not been modified.
If we want to set the minutes field to zero, we can use the following syntax:
job.every(15).hours()
This will set the schedule to 0 */4 * * *
. Similarly for the 'day of the month', 'month' and 'day of the week' fields.
Examples:
job.every(2).month
is equivalent to0 0 0 */2 *
andjob.month.every(2)
is equivalent to* * * */2 *
job.every(2).dows
is equivalent to0 0 * * */2
andjob.dows.every(2)
is equivalent to* * * * */2
We can see the differences in the following example:
from crontab import CronTab
cron = CronTab(user='username')
job1 = cron.new(command='python example1.py')
job1.hour.every(2)
job2 = cron.new(command='python example1.py')
job2.every(2).hours()
for item in cron:
print item
cron.write()
After running the program, the result is as follows:
$ python cron2.py
* */2 * * * python /home/eca/cron/example1.py
0 */2 * * * python /home/eca/cron/example1.py
the program has set the second task's minutes to zero, and defined the first task minutes' to its default value.
Finally, we can set the task to be run every time we boot our machine. The syntax is as follows:
job.every_reboot()
Clearing Restrictions
We can clear all task's restrictions with the following command:
job.clear()
The following code shows how to use the above command:
from crontab import CronTab
cron = CronTab(user='username')
job = cron.new(command='python example1.py', comment='comment')
job.minute.every(5)
for item in cron:
print item
job.clear()
for item in cron:
print item
cron.write()
After running the code we get the following result:
$ python cron3.py
*/5 * * * * python /home/eca/cron/example1.py # comment
* * * * * python /home/eca/cron/example1.py # comment
the schedule has changed from every 5 minutes to the default setting.
Enabling and Disabling a Job
A task can be enabled or disabled using the following commands:
To enable a job:
job.enable()
To disable a job:
job.enable(False)
In order to verify whether a task is enabled or disabled, we can use the following command:
job.is_enabled()
The following example shows how to enable and disable a previously created job, and verify both states:
from crontab import CronTab
cron = CronTab(user='username')
job = cron.new(command='python example1.py', comment='comment')
job.minute.every(1)
cron.write()
print job.enable()
print job.enable(False)
The result is as follows:
$ python cron4.py
True
False
Checking Validity
We can easily check whether a task is valid or not with the following command:
job.is_valid()
The following example shows how to use this command:
from crontab import CronTab
cron = CronTab(user='username')
job = cron.new(command='python example1.py', comment='comment')
job.minute.every(1)
cron.write()
print job.is_valid()
After running the above program, we obtain the validation, as seen in the following figure:
$ python cron5.py
True
Listing All Cron Jobs
All cron jobs, including disabled jobs can be listed with the following code:
for job in cron:
print job
Adding those lines of code to our first example will show our task by printing on the screen the following:
$ python cron6.py
* * * * * python /home/eca/cron/example1.py
Finding a Job
The Python crontab module also allows us to search for tasks based on a selection criterion, which can be based on a command, a comment, or a scheduled time. The syntaxes are different for each case.
Find according to command:
cron.find_command("command name")
Here 'command name' can be a sub-match or a regular expression.
Find according to comment:
cron.find_comment("comment")
Find according to time:
cron.find_time(time schedule)
The following example shows how to find a previously defined task, according to the three criteria previously mentioned:
from crontab import CronTab
cron = CronTab(user='username')
job = cron.new(command='python example1.py', comment='comment')
job.minute.every(1)
cron.write()
iter1 = cron.find_command('exam')
iter2 = cron.find_comment('comment')
iter3 = cron.find_time("*/1 * * * *")
for item1 in iter1:
print item1
for item2 in iter2:
print item2
for item3 in iter3:
print item3
The result is the listing of the same job three times:
$ python cron7.py
* * * * * python /home/eca/cron/example1.py # comment
* * * * * python /home/eca/cron/example1.py # comment
* * * * * python /home/eca/cron/example1.py # comment
As you can see, it correctly finds the cron command each time.
Removing Jobs
Each job can be removed separately. The syntax is as follows:
cron.remove(job)
The following code shows how to remove a task that was previously created. The program first creates the task. Then, it lists all tasks, showing the one just created. After this, it removes the task, and shows the resulting empty list.
from crontab import CronTab
cron = CronTab(user='username')
job = cron.new(command='python example1.py')
job.minute.every(1)
cron.write()
print "Job created"
# list all cron jobs (including disabled ones)
for job in cron:
print job
cron.remove(job)
print "Job removed"
# list all cron jobs (including disabled ones)
for job in cron:
print job
The result is as follows:
$ python cron8.py
Job created
* * * * * python /home/eca/cron/example1.py
Job removed
Jobs can also be removed based on a condition. For example:
cron.remove_all(comment='my comment')
This will remove all jobs where comment='my comment'
.
Clearing All Jobs
All cron jobs can be removed at once by using the following command:
cron.remove_all()
The following example will remove all cron jobs and show an empty list.
from crontab import CronTab
cron = CronTab(user='username')
cron.remove_all()
# list all cron jobs (including disabled ones)
for job in cron:
print job
Environmental Variables
We can also define environmental variables specific to our scheduled task and show them on the screen. The variables are saved in a dictionary. The syntax to define a new environmental variable is as follows:
job.env['VARIABLE_NAME'] = 'Value'
If we want to get the values for all the environmental variables, we can use the following syntax:
job.env
The example below defines two new environmental variables for the task 'user', and shows their value on the screen. The code is as follows:
from crontab import CronTab
cron = CronTab(user='username')
job = cron.new(command='python example1.py')
job.minute.every(1)
job.env['MY_ENV1'] = 'A'
job.env['MY_ENV2'] = 'B'
cron.write()
print job.env
After running the above program, we get the following result:
$ python cron9.py
MY_ENV1=A
MY_ENV2=B
In addition, Cron-level environment variables are stored in 'cron.env'.
Log Functionality
The log functionality will read a cron log backwards to find you last run instances of your crontab and cron jobs.
The crontab will limit returned entries to user the crontab is for.
cron = CronTab(user=’root’)
for d in cron.log:
print d[‘pid’] + ” – ” + d[‘date’]
Each job can return a log iterator too, these are filtered so you can see when the last execution was.
for d in cron.find_command(‘echo’)[0].log:
print d[‘pid’] + ” – ” + d[‘date’]
Schedule Functionality
If you have croniter python module installed, you will have access to a schedule on each job. For example if you want to know when a job will next run:
schedule = job.schedule(date_from=datetime.now())
This creates a schedule croniter based on the job from time specified. The default date_from is current date/time if not specified. Next we can get datetime of the next job:
datetime = schedule.get_next()
Or the previous:
datetime = schedule.get_prev()
The get methods work in the same way as default croniter, except that they will return datetime objects by default instead of floats. If you want the original functionality, pass float into method when calling:
datetime = schedule.get_current(float)
If you don’t have croniter module installed, you’ll get an ImportError when you first try using schedule function on your cron job object.
CronManager Example
import argparse
import os ,sys
import logging
from crontab import CronTab
"""
Task Scheduler
==========
This module manages periodic tasks using cron.
"""
class CronManager:
def __init__(self):
self.cron = CronTab(user=True)
def add_minutely(self, name, user, command, environment=None):
"""
Add an hourly cron task
"""
cron_job = self.cron.new(command=command, user=user)
cron_job.minute.every(2)
cron_job.enable()
self.cron.write()
if self.cron.render():
print self.cron.render()
return True
def add_hourly(self, name, user, command, environment=None):
"""
Add an hourly cron task
"""
cron_job = self.cron.new(command=command, user=user)
cron_job.minute.on(0)
cron_job.hour.during(0,23)
cron_job.enable()
self.cron.write()
if self.cron.render():
print self.cron.render()
return True
def add_daily(self, name, user, command, environment=None):
"""
Add a daily cron task
"""
cron_job = self.cron.new(command=command, user=user)
cron_job.minute.on(0)
cron_job.hour.on(0)
cron_job.enable()
self.cron.write()
if self.cron.render():
print self.cron.render()
return True
def add_weekly(self, name, user, command, environment=None):
"""
Add a weekly cron task
"""
cron_job = self.cron.new(command=command)
cron_job.minute.on(0)
cron_job.hour.on(0)
cron_job.dow.on(1)
cron_job.enable()
self.cron.write()
if self.cron.render():
print self.cron.render()
return True
def add_monthly(self, name, user, command, environment=None):
"""
Add a monthly cron task
"""
cron_job = self.cron.new(command=command)
cron_job.minute.on(0)
cron_job.hour.on(0)
cron_job.day.on(1)
cron_job.month.during(1,12)
cron_job.enable()
self.cron.write()
if self.cron.render():
print self.cron.render()
return True
def add_quarterly(self, name, user, command, environment=None):
"""
Add a quarterly cron task
"""
cron_job = self.cron.new(command=command)
cron_job.minute.on(0)
cron_job.hour.on(0)
cron_job.day.on(1)
cron_job.month.on(3,6,9,12)
cron_job.enable()
self.cron.write()
if self.cron.render():
print self.cron.render()
return True
def add_anually(self, name, user, command, environment=None):
"""
Add a yearly cron task
"""
cron_job = self.cron.new(command=command)
cron_job.minute.on(0)
cron_job.hour.on(0)
cron_job.month.on(12)
cron_job.enable()
self.cron.write()
if self.cron.render():
print self.cron.render()
return True
Split urls into components in python
Aug. 27, 2020, 5:37 p.m.
331Split URLs into Components in python
The urllib.parse
module provides functions for manipulating URLs and their component parts, to either break them down or build them up.
Parsing
The return value from the urlparse()
function is a ParseResult
object that acts like a tuple
with six elements.
urllib_parse_urlparse.py
from urllib.parse import urlparse
url = 'http://netloc/path;param?query=arg#frag'
parsed = urlparse(url)
print(parsed)
The parts of the URL available through the tuple interface are the scheme, network location, path, path segment parameters (separated from the path by a semicolon), query, and fragment.
RUN:
python3 urllib_parse_urlparse.py
output:
ParseResult(scheme='http', netloc='netloc', path='/path',
params='param', query='query=arg', fragment='frag')
Although the return value acts like a tuple, it is really based on a namedtuple
, a subclass of tuple
that supports accessing the parts of the URL via named attributes as well as indexes. In addition to being easier to use for the programmer, the attribute API also offers access to several values not available in the tuple
API.
urllib_parse_urlparseattrs.py
from urllib.parse import urlparse
url = 'http://user:[email protected]:80/path;param?query=arg#frag'
parsed = urlparse(url)
print('scheme :', parsed.scheme)
print('netloc :', parsed.netloc)
print('path :', parsed.path)
print('params :', parsed.params)
print('query :', parsed.query)
print('fragment:', parsed.fragment)
print('username:', parsed.username)
print('password:', parsed.password)
print('hostname:', parsed.hostname)
print('port :', parsed.port)
The username
and password
are available when present in the input URL, and set to None
when not. The hostname
is the same value as netloc
, in all lower case and with the port value stripped. And the port
is converted to an integer when present and None
when not.
RUN:
python3 urllib_parse_urlparseattrs.py
output:
scheme : http
netloc : user:[email protected]:80
path : /path
params : param
query : query=arg
fragment: frag
username: user
password: pwd
hostname: netloc
port : 80
The urlsplit()
function is an alternative to urlparse()
. It behaves a little differently, because it does not split the parameters from the URL. This is useful for URLs following RFC 2396, which supports parameters for each segment of the path.
urllib_parse_urlsplit.py
from urllib.parse import urlsplit
url = 'http://user:[email protected]:80/p1;para/p2;para?query=arg#frag'
parsed = urlsplit(url)
print(parsed)
print('scheme :', parsed.scheme)
print('netloc :', parsed.netloc)
print('path :', parsed.path)
print('query :', parsed.query)
print('fragment:', parsed.fragment)
print('username:', parsed.username)
print('password:', parsed.password)
print('hostname:', parsed.hostname)
print('port :', parsed.port)
Since the parameters are not split out, the tuple API will show five elements instead of six, and there is no params
attribute.
RUN:
python3 urllib_parse_urlsplit.py
Output:
SplitResult(scheme='http', netloc='user:[email protected]:80',
path='/p1;para/p2;para', query='query=arg', fragment='frag')
scheme : http
netloc : user:[email protected]:80
path : /p1;para/p2;para
query : query=arg
fragment: frag
username: user
password: pwd
hostname: netloc
port : 80
To simply strip the fragment identifier from a URL, such as when finding a base page name from a URL, use urldefrag()
.
urllib_parse_urldefrag.py
from urllib.parse import urldefrag
original = 'http://netloc/path;param?query=arg#frag'
print('original:', original)
d = urldefrag(original)
print('url :', d.url)
print('fragment:', d.fragment)
The return value is a DefragResult
, based on namedtuple
, containing the base URL and the fragment.
RUN:
python3 urllib_parse_urldefrag.py
Output:
original: http://netloc/path;param?query=arg#frag
url : http://netloc/path;param?query=arg
fragment: frag
Unparsing
There are several ways to assemble the parts of a split URL back together into a single string. The parsed URL object has a geturl()
method.
urllib_parse_geturl.py
from urllib.parse import urlparse
original = 'http://netloc/path;param?query=arg#frag'
print('ORIG :', original)
parsed = urlparse(original)
print('PARSED:', parsed.geturl())
geturl()
only works on the object returned by urlparse()
or urlsplit()
.
RUN:
python3 urllib_parse_geturl.py
Output:
ORIG : http://netloc/path;param?query=arg#frag
PARSED: http://netloc/path;param?query=arg#frag
A regular tuple containing strings can be combined into a URL with urlunparse()
.
urllib_parse_urlunparse.py
from urllib.parse import urlparse, urlunparse
original = 'http://netloc/path;param?query=arg#frag'
print('ORIG :', original)
parsed = urlparse(original)
print('PARSED:', type(parsed), parsed)
t = parsed[:]
print('TUPLE :', type(t), t)
print('NEW :', urlunparse(t))
While the ParseResult
returned by urlparse()
can be used as a tuple, this example explicitly creates a new tuple to show that urlunparse()
works with normal tuples, too.
RUN:
python3 urllib_parse_urlunparse.py
Output:
ORIG : http://netloc/path;param?query=arg#frag
PARSED: <class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='netloc', path='/path',
params='param', query='query=arg', fragment='frag')
TUPLE : <class 'tuple'> ('http', 'netloc', '/path', 'param',
'query=arg', 'frag')
NEW : http://netloc/path;param?query=arg#frag
If the input URL included superfluous parts, those may be dropped from the reconstructed URL.
urllib_parse_urlunparseextra.py
from urllib.parse import urlparse, urlunparse
original = 'http://netloc/path;?#'
print('ORIG :', original)
parsed = urlparse(original)
print('PARSED:', type(parsed), parsed)
t = parsed[:]
print('TUPLE :', type(t), t)
print('NEW :', urlunparse(t))
In this case, parameters
, query
, and fragment
are all missing in the original URL. The new URL does not look the same as the original, but is equivalent according to the standard.
RUN:
python3 urllib_parse_urlunparseextra.py
Output:
ORIG : http://netloc/path;?#
PARSED: <class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='netloc', path='/path',
params='', query='', fragment='')
TUPLE : <class 'tuple'> ('http', 'netloc', '/path', '', '', '')
NEW : http://netloc/path
Joining
In addition to parsing URLs, urlparse
includes urljoin()
for constructing absolute URLs from relative fragments.
urllib_parse_urljoin.py
from urllib.parse import urljoin
print(urljoin('http://www.example.com/path/file.html',
'anotherfile.html'))
print(urljoin('http://www.example.com/path/file.html',
'../anotherfile.html'))
In the example, the relative portion of the path ("../"
) is taken into account when the second URL is computed.
RUN:
python3 urllib_parse_urljoin.py
Output:
http://www.example.com/path/anotherfile.html
http://www.example.com/anotherfile.html
Non-relative paths are handled in the same way as by os.path.join()
.
urllib_parse_urljoin_with_path.py
from urllib.parse import urljoin
print(urljoin('http://www.example.com/path/',
'/subpath/file.html'))
print(urljoin('http://www.example.com/path/',
'subpath/file.html'))
If the path being joined to the URL starts with a slash (/
), it resets the URL’s path to the top level. If it does not start with a slash, it is appended to the end of the path for the URL.
RUN:
python3 urllib_parse_urljoin_with_path.py
Output:
http://www.example.com/subpath/file.html
http://www.example.com/path/subpath/file.html
Encoding Query Arguments
Before arguments can be added to a URL, they need to be encoded.
urllib_parse_urlencode.py
from urllib.parse import urlencode
query_args = {
'q': 'query string',
'foo': 'bar',
}
encoded_args = urlencode(query_args)
print('Encoded:', encoded_args)
Encoding replaces special characters like spaces to ensure they are passed to the server using a format that complies with the standard.
RUN:
python3 urllib_parse_urlencode.py
Output:
Encoded: q=query+string&foo=bar
To pass a sequence of values using separate occurrences of the variable in the query string, set doseq
to True
when calling urlencode()
.
urllib_parse_urlencode_doseq.py
from urllib.parse import urlencode
query_args = {
'foo': ['foo1', 'foo2'],
}
print('Single :', urlencode(query_args))
print('Sequence:', urlencode(query_args, doseq=True))
The result is a query string with several values associated with the same name.
RUN:
python3 urllib_parse_urlencode_doseq.py
Output:
Single : foo=%5B%27foo1%27%2C+%27foo2%27%5D
Sequence: foo=foo1&foo=foo2
To decode the query string, use parse_qs()
or parse_qsl()
.
urllib_parse_parse_qs.py
from urllib.parse import parse_qs, parse_qsl
encoded = 'foo=foo1&foo=foo2'
print('parse_qs :', parse_qs(encoded))
print('parse_qsl:', parse_qsl(encoded))
The return value from parse_qs()
is a dictionary mapping names to values, while parse_qsl()
returns a list of tuples containing a name and a value.
RUN:
python3 urllib_parse_parse_qs.py
Output:
parse_qs : {'foo': ['foo1', 'foo2']}
parse_qsl: [('foo', 'foo1'), ('foo', 'foo2')]
Special characters within the query arguments that might cause parse problems with the URL on the server side are “quoted” when passed to urlencode()
. To quote them locally to make safe versions of the strings, use the quote()
or quote_plus()
functions directly.
urllib_parse_quote.py
from urllib.parse import quote, quote_plus, urlencode
url = 'http://localhost:8080/~hellmann/'
print('urlencode() :', urlencode({'url': url}))
print('quote() :', quote(url))
print('quote_plus():', quote_plus(url))
The quoting implementation in quote_plus()
is more aggressive about the characters it replaces.
RUN:
python3 urllib_parse_quote.py
Output:
urlencode() : url=http%3A%2F%2Flocalhost%3A8080%2F~hellmann%2F
quote() : http%3A//localhost%3A8080/~hellmann/
quote_plus(): http%3A%2F%2Flocalhost%3A8080%2F~hellmann%2F
To reverse the quote operations, use unquote()
or unquote_plus()
, as appropriate.
urllib_parse_unquote.py
from urllib.parse import unquote, unquote_plus
print(unquote('http%3A//localhost%3A8080/%7Ehellmann/'))
print(unquote_plus(
'http%3A%2F%2Flocalhost%3A8080%2F%7Ehellmann%2F'
))
The encoded value is converted back to a normal string URL.
RUN:
python3 urllib_parse_unquote.py
Output:
http://localhost:8080/~hellmann/
http://localhost:8080/~hellmann/
Convert python dictionary to a json string
July 16, 2020, 3:14 a.m.
334Convert Python Dictionary to a JSON string
test.py
import json
python_information = {'id': '1234567890', 'name': 'Naruto', 'job': 'Software Engineer', 'languages': [{'English': 'Professional'}, {'Japanese': 'Professional'}, {'Korean': 'Native'}]}
# Convert Python Dictionary into a JSON String:
p = json.dumps(python_information)
print(p)
Output:
$ python test.py
{"id": "1234567890", "name": "Naruto", "job": "Software Engineer", "languages": [{"English": "Professional"}, {"Japanese": "Professional"}, {"Korean": "Native"}]}
Note:
*if you try print(p["id"]), it is going to fail. Why? it is not a Python Dictionary anymore.
- The python dictionary`s data with a single quote is changed a JSON string with double quotes.
A JSON string is only possible with double quotes
A JSON string is not a data structure. Do not confuse with Python dictionary.
Convert json string to python dictionary
July 16, 2020, 3:09 a.m.
205Convert JSON string to Python dictionary
A JSON String is JavaScript object notation.
A JSON String is a serialization format.
Python Dictionary is a data structure that implements all of its own algorithms.
Python's Dictionary key can be any hash object, and JSON can only be a string.
*JSON is text, dictionaries are a data structure in memory.
import json
json_information = """
{
"id": "1234567890",
"name": "Naruto",
"job": "Software Engineer",
"languages": [
{
"English": "Professional"
},
{
"Japanese": "Professional"
},
{
"Korean": "Native"
}
]
}
"""
# Convert a JSON string to Python dictionary:
j = json.loads(json_information)
print(j["id"])
print(j["name"])
print(j["job"])
print(j["languages"])
print(j)
Output:
1234567890
Naruto
Software Engineer
[{'English': 'Professional'}, {'Japanese': 'Professional'}, {'Korean': 'Native'}]
{'id': '1234567890', 'name': 'Emily', 'job': 'Software Engineer', 'languages': [{'English': 'Professional'}, {'Japanese': 'Professional'}, {'Korean': 'Native'}]}
Python tools to write better code
April 8, 2020, 1:12 p.m.
510Python Tools To Write Better Code
Python community maintains a set of tools that are helpful in every project. They provide quick feedback of your code health and how much it sticks to standards and better practices.
These tools are:
1)pep8
style checker
2)pyflakes
checks source code for errors
3)mccabe
complexity checker
4)flake8
code checker (pep8, pyflakes, mccabe, and third-party plugins to check the style and quality of some python code)
5)Pylint
Checks for coding standards, errors and duplicated code.
6)Coverage
measure effectiveness of tests
7)Black
The uncompromising Python code formatter
How to operations and usage of sets on python
Nov. 16, 2019, 8:30 a.m.
370How to Operations and Usage of Sets on Python
- Set is a built in data-type(data-structure) in python.
- It only stores unique elements.
- It do not contains duplicate elements.
- We can iterate over sets.
What is set ?
How to create a set in python?
# case1: empty set with built-in keyword
s = set()
print(s)
# Output: set([])
print(type(s))
# Output: set
# case2: set with initial data
s = {1, 'a', '@', 'batta', 2.22}
print(s)
# Output: set([2.22, 'batta', 1, 'a', '@'])
print(type(s))
# Output: set
How to add an element to a set in python ?
s = {1,2,3,4,5,6,'a'}
print(s)
# Output: set(['a', 1, 2, 3, 4, 5, 6])
s.add(100)
print(s)
# Output: set(['a', 1, 2, 3, 4, 5, 6, 100])
How to remove an element from set in python ?
s = {1,2,3,4,5,6,'a'}
r = s.remove(5)
print(s)
# Output: set(['a', 1, 2, 3, 4, 6])
print(r)
# Output: None
s = {1,2,3,4,5,6,'a'}
r = s.discard(5)
print(s)
# Output: set(['a', 1, 2, 3, 4, 6])
print(r)
# Output: None
How to remove and get last element from set in python ?
s = {1,2,3,4,5,6,'a'}
r = s.pop()
print(s)
# Output: set([1, 2, 3, 4, 5, 6])
print(r)
# Output: 'a'
How to clear or empty the set in python?
s = {'a', 1,2,3,4,5,6}
s.clear()
print(s)
# Output: set([])
How to copy set1 to set2 ?
s1 = {1,2,3,4, 'a', 'hello', 1.255}
# id - it is a built-in function it returns the memory location of python object.
s2 = s1.copy()
print(id(s1))
# Output: 139864877446360
print(id(s2))
# Output: 139864877446128
# if we do not use 'copy'
s3 = {1,2,3,4}
s4 = s3
print(id(s3))
# Output: 139864878456528
print(id(s4))
# Output: 139864878456528
How to find difference of two sets in python ?
s1 = {1,2,3,4,5,6,7,8}
s2 = {4,5,6,9, 10, 11, 12, 13, 14}
difference = s1.difference(s2)
print(difference)
# Output: set([8, 1, 2, 3, 7])
# shortcut method
difference = s1 - s2
print(difference)
# Output: set([8, 1, 2, 3, 7])
# It do not modify the initial data in s1, s2
print(s1)
# Output: set([1, 2, 3, 4, 5, 6, 7, 8])
print(s2)
# Output: set([4, 5, 6, 9, 10, 11, 12, 13, 14])
How to update the difference of set1, set2 into set1 ?
s1 = {1,2,3,4,5,6,7,8}
s2 = {4,5,6,9, 10, 11, 12, 13, 14}
difference = s1.difference_update(s2)
print(difference)
# Output: None
print(s1)
# Output: set([1, 2, 3, 7, 8])
print(s2)
# Output: set([4, 5, 6, 9, 10, 11, 12, 13, 14])
How to find the common elements in set1, set2 in python?
s1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
s2 = {4, 5, 6, 9, 10, 11, 12, 13, 14}
intersection = s1.intersection(s2)
print(intersection)
# Output: set([4, 5, 6, 9, 10, 11])
How to check whether set1 and set2 contains common elements or not ?
# find if sets disjoint or not
set1 = set([1, 2.55, 3, "a"])
set2 = set(["hello", "world", "batta"])
is_disjoint = set1.isdisjoint(set2)
print(is_disjoint)
# Output: True
# Because there are no common elements
set1 = set([1, 2.55, 3, "a"])
set2 = set(["hello", 1, "a"])
is_disjoint = set1.isdisjoint(set2)
print(is_disjoint)
# Output: False
# common elements are 1, "a"
How to check if given set s1 is subset of other set s2 or not ?
# case1
set1 = {0, 1, 2, 3, "abcd", 5, 6, 7, 8, 9} # superset
set2 = {1, 3, 5} # subset
result = set2.issubset(set1)
print(result)
# Output: True
# case2
set1 = {0, 1, 2, 3, "abcd", 5, 6, 7, 8, 9} # superset
set2 = {1, 3, 5} # subset
result = set1.issubset(set2)
print(result)
# Output: False
# case3
set1 = {0, 1, 2, 3, "abcd", 5, 6, 7, 8, 9}
set2 = {1, 3, 5, "extra element"}
result = set2.issubset(set1)
print(result)
# Output: False
How to check if given set s1 is superset of other set s2 or not ?
# case1
set1 = {0, 1, 2, 3, "abcd", 5, 6, 7, 8, 9} # superset
set2 = {1, 3, 5} # subset
result = set1.issuperset(set2)
print(result)
# Output: True
# case2
set1 = {0, 1, 2, 3, "abcd", 5, 6, 7, 8, 9} # superset
set2 = {1, 3, 5} # subset
result = set2.issuperset(set1)
print(result)
# Output: False
# case3
set1 = {0, 1, 2, 3, "abcd", 5, 6, 7, 8, 9}
set2 = {1, 3, 5, "extra element"}
result = set1.issuperset(set2)
print(result)
# Output: False
How to combine(union) set1 and set2 ?
set1 = {0, 1, 2, 3, 5, 6, 7, 8, 9, 'abcd'}
set2 = {1, 3, 5, 'extra element'}
result = set1.union(set2) # it will not modify the original sets & returns union of two sets.
print(result)
# Output: {0, 1, 2, 3, 5, 6, 7, 8, 9, 'abcd', 'extra element'}
# check if original sets changed or not
print(set1)
# Output: {0, 1, 2, 3, 5, 6, 7, 8, 9, 'abcd'}
prnt(set2)
# Output: {1, 3, 5, 'extra element'}
How to update set1 with other set set2 ?
set1 = {0, 1, 2, 3, 5, 6, 7, 8, 9, 'abcd'}
set2 = {1, 3, 5, 'extra element'}
result = set1.update(set2) # it will update the operating set & returns None.
print(result)
# Output: None
print(set1)
# Output: {'abcd', 1, 2, 3, 5, 6, 7, 8, 9, 'extra element', 0}
print(set2)
# Output: {1, 3, 5, 'extra element'}
How to find symmetric difference of two sets ?
symmetric difference is known as the disjunctive union, of two sets is the set of elements which are in either of the sets and not in their intersection.
set1 = {0, 1, 2, 3, 5, 6, 7, 8, 9, 'abcd'}
set2 = {1, 3, 5, 'extra element'}
result = set1.symmetric_difference(set2) # It will not change the original sets.
# it removes common elements in both sets and returns the union of remaining elements.
print(result)
# Output: {0, 2, 6, 7, 8, 9, 'abcd', 'extra element'}
# print original sets
print(set1)
# Output: {0, 1, 2, 3, 5, 6, 7, 8, 9, 'abcd'}
print(set2)
# Output: {1, 3, 5, 'extra element'}
How to find symmetric difference of two sets and update the result in same set?
set1 = {0, 1, 2, 3, 5, 6, 7, 8, 9, 'abcd'}
set2 = {1, 3, 5, 'extra element'}
result = set1.symmetric_difference_update(set2)
# it removes common elements in both sets and updates set1 with union of remaining elements.
print(result)
# Output: None
# print original sets
print(set1)
# Output: {0, 2, 6, 7, 8, 9, 'abcd', 'extra element'}
print(set2)
# Output: {1, 3, 5, 'extra element'}
How to use "map" keyword or function in python?
Nov. 15, 2019, 10:45 p.m.
227How to use "map" keyword or function in python?
- "map' is a built-in function in python.
- It takes first argument as a function or a callable object.
- All other arguments must be sequence of elements otherwise it raises an error.
- we can reduce the number of lines code using map.
- It's like functional programming technique.
- map return a list of the results of applying the function to the items of the argument sequence(s).
- If more than one sequence is given, the function is called with an argument list consisting of the corresponding item of each sequence, substituting 'None' for missing values when not all sequences have the same length.
Convert list of numbers to list of strings without using map
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
output = []
for i in l:
output.append(str(i))
print(output)
# Output: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Convert list of numbers to list of strings using map
# I love to use it
l = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
output = map(str, l)
print(output)
# Output: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Return list of elements by adding numbers of two lists based on their indexes - Traditional way
l1 = [1, 2, 3, 4, 5, 6]
l2 = [10, 20, 30, 40, 50, 60]
output = []
for i, j in zip(l1, l2):
output.append(i+j)
print(output)
# Output: [11, 22, 33, 44, 55, 66]
Return list of elements by adding numbers of two lists based on their indexes - Functional Programming
l1 = [1, 2, 3, 4, 5, 6]
l2 = [10, 20, 30, 40, 50, 60]
def sum_elements(a, b):
return a + b
output = map(sum_elements, l1, l2)
print(output)
# Output: [11, 22, 33, 44, 55, 66]
Let's test the map with different types of inputs
how to use multiple arguments with "map" python?
def function(*args):
return args
output = map(function, [1,2,3], ("a", "b", "c"), (1.2, 2.3, 3.4))
print(output)
# Output: [(1, 'a', 1.2), (2, 'b', 2.3), (3, 'c', 3.4)]
how to use multiple arguments of varying length with "map" with python?
def function(*args):
return args
output = map(function, [1,2,3], ("a", "b", "c", "d", "e", "f"), (1.2, 2.3, 3.4))
print(output)
# Output: [(1, 'a', 1.2), (2, 'b', 2.3), (3, 'c', 3.4), (None, 'd', None), (None, 'e', None), (None, 'f', None)]
use "None" instead of function with "map" in python
output = map(None, [1,2,3], ("a", "b", "c"), (1.2, 2.3, 3.4))
print(output)
# Output: [(1, 'a', 1.2), (2, 'b', 2.3), (3, 'c', 3.4)]
use other data types like "list" instead of function with "map" in python
output = map([], [1,2,3], ("a", "b", "c"), (1.2, 2.3, 3.4))
# Output: TypeError: 'list' object is not callable
How to use "reduce" builtin function in python?
Nov. 15, 2019, 10:39 p.m.
210how to use "reduce" builtin function in python?
- reduce is a built-in function in python module "__builtin__".
- It takes function as first argument and second argument as an sequence of items.
- It applies function on two arguments cumulatively to the items of a sequence from left to right and returns a single value.
- If initial is present, it is placed before the items of the sequence in the calculation, and serves as a default when the sequence is empty.
- It doesn't allow empty sequence.
Let's see examples
Let's sum all the elements in a list
traditional way
l = [1, 2, 3, 5, 6]
sum_of_numbers = 0
for num in l:
sum_of_numbers += num
print(sum_of_numbers)
# Output: 17
Let's do it using "reduce"
l = [1, 2, 3, 5, 6]
def sum_numbers(a, b):
return a + b
sum_of_numbers = reduce(sum_numbers, l)
print(sum_of_numbers)
# Output: 17
Let's find factorial of number 5
traditional way
num = 5
result = 1
for i in range(1, num+1):
result = result * i
print("Factorial of 5 = %s" % (result))
# Output: 120
Let's do it using "reduce"
num = 5
result = reduce(lambda x, y : x*y, range(1, num+1))
print("Factorial of 5 = %s" % (result))
# Output: 120
Let's find out max number in a list of numbers using "reduce" and "max"
l = [3, 1, 5, 10, 7, 6]
max_num = reduce(max, l)
print("max num = %s" % (max_num))
# Output: max num = 10
How to use "filter" builtin function in python?
Nov. 15, 2019, 10:32 p.m.
203how to use "filter" builtin function in python?
- "filter" is a python's built-in function which can be found in module "__builtin__".
- It takes two arguments, first argument as a function and second argument as sequence of objects/elements.
- It passes all objects/elements to given function one after other.
- If function returns "True" then it appends the object/element to a list & returns the list after passing all elements.
- If sequence is a tuple or string, it returns the same type, else return a list.
- If function is None, return the items that are true.
Let's see an example for "filter"
Q. Find out the all prime numbers below hundred ?
1. Traditional way
num = 100
primes = []
for i in range(2, 100):
for j in range(2, i):
if i % j == 0:
break
else:
primes.append(i)
print(primes)
# Output: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
2. Using function "filter"
def is_prime(num):
for j in range(2, num):
if num % j == 0:
return False
else:
return True
primes = filter(is_prime, range(1, 100))
print(primes)
# Output: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
print(type(primes))
# Output: list
Lets test filter by passing "None" as first argument
a = (0, 1, 2, 3)
l = filter(None, a)
print(l)
# Output: (1, 2, 3)
print(type(l))
# Output: tuple
It converts element to boolean if it returns true then it will add element to list/tuple/string.
Let's apply "filter" on strings.
Remove vowels from string in python.
def remove_vowels(char):
return char.lower() not in ['a', 'e', 'i', 'o', 'u']
s = filter(remove_vowels, "this is anjaneyulu batta")
print(s)
# Output: ths s njnyl btt
Usage of "datetime" from datetime module with use cases
Nov. 15, 2019, 10:12 p.m.
240Usage of "datetime" from datetime module with use cases
- datetime can be found in the module datetime.
- datetime is a python's representation of date and time in a single object.
how to create datetime object in python ?
from datetime import datetime
time_date = datetime(
year=2017, month=3, day=2, hour=14, minute=0, second=0, microsecond=0, tzinfo=None
)
print(type(time_date))
# Output: 'datetime.datetime'
print(time_date)
# Output: 2017-03-02 14:00:00
How to get date from datetime object ?
from datetime import datetime
time_date = datetime(
year=2017, month=3, day=2, hour=14, minute=0, second=0, microsecond=0, tzinfo=None
)
date = time_date.date()
print(type(date))
# Output: datetime.date
print(date)
# Output: 2017-03-02
How to get current datetime object in python ?
from datetime import datetime
current_datetime = datetime.now()
print(current_datetime)
# Output: 2017-03-07 21:31:59.720195
How to get year, month, day, hour, minute, second in python?
from datetime import datetime
current_datetime = datetime.now()
year = current_datetime.year
month = current_datetime.month
day = current_datetime.day
hour = current_datetime.hour
minute = current_datetime.minute
second = current_datetime.second
microsecond = current_datetime.microsecond
print("year= %s, month=%s, day=%s, hour=%s, minute=%s, second=%s, microsecond=%s" % (year, month, day, hour, minute, second, microsecond))
# output: year= 2017, month=3, day=7, hour=21, minute=41, second=50, microsecond=781903
How to convert a datetime object to string with a specific format python(datetime.strftime)?
"2017-02-04 12:02:33"
from datetime import datetime
date = datetime(2017, 2, 4, 12, 2, 33)
s = datetime.strftime(d, "%Y-%m-%d %H:%M:%S")
print(s)
#Output: '2017-02-04 12:02:33'
"2017/02/04 12:02:33"
from datetime import datetime
date = datetime(2017, 2, 4, 12, 2, 33)
s = datetime.strftime(d, "%Y/%m/%d %H:%M:%S")
print(s)
#Output: 2017/02/04 12:02:33
"04/02/2017 12:02:33"
from datetime import datetime
date = datetime(2017, 2, 4, 12, 2, 33)
s = datetime.strftime(date, "%d/%m/%Y %H:%M:%S")
print(s)
# Output: '04/02/2017 12:02:33'
"04 February 2017 12:02:33 PM"
from datetime import datetime
date = datetime(2017, 2, 4, 12, 2, 33)
s = datetime.strftime(date, "%d %B %Y %H:%M:%S %p")
print(s)
# Output: 04 February 2017 12:02:33 PM
"Saturday February 2017 12:02:33 PM"
from datetime import datetime
date = datetime(2017, 2, 4, 12, 2, 33)
s = datetime.strftime(date, "%A %B %Y %H:%M:%S %p")
print(s)
# Output: Saturday February 2017 12:02:33 PM
"Sat February 2017 12:02:33 pm"
from datetime import datetime
date = datetime(2017, 2, 4, 12, 2, 33)
s = datetime.strftime(date, "%a %B %Y %H:%M:%S %P")
print(s)
# Output: Sat February 2017 12:02:33 pm
How to convert a string to datetime object python(datetime.strptime)?
"2017-02-04 12:02:33"
from datetime import datetime
s = "2017-02-04 12:02:33"
d = datetime.strptime(s, "%Y-%m-%d %H:%M:%S")
print(d)
# Output: datetime.datetime(2017, 2, 4, 12, 2, 33)
"2017/02/04 12:02:33"
from datetime import datetime
s = "2017/02/04 12:02:33"
d = datetime.strptime(s, "%Y/%m/%d %H:%M:%S")
print(d)
# Output: datetime.datetime(2017, 2, 4, 12, 2, 33)
"04/02/2017 12:02:33"
from datetime import datetime
s = "04/02/2017 12:02:33"
d = datetime.strptime(s, "%d/%m/%Y %H:%M:%S")
print(d)
# Output: datetime.datetime(2017, 2, 4, 12, 2, 33)
"04 February 2017 12:02:33 PM"
from datetime import datetime
s = "04 February 2017 12:02:33 PM"
d = datetime.strptime(s, "%d %B %Y %H:%M:%S %p")
print(d)
# Output: datetime.datetime(2017, 2, 4, 12, 2, 33)
"04 February 2017 12:02:33 PM"
from datetime import datetime
s = "04 February 2017 12:02:33 PM"
d = datetime.strptime(s, "%d %B %Y %H:%M:%S %p")
print(d)
# Output: datetime.datetime(2017, 2, 4, 12, 2, 33)
"Saturday 04 February 2017 12:02:33 PM"
from datetime import datetime
s = "Saturday 04 February 2017 12:02:33 PM"
d = datetime.strptime(s, "%A %d %B %Y %H:%M:%S %p")
print(d)
# Output: datetime.datetime(2017, 2, 4, 12, 2, 33)
"Sat February 2017 12:02:33 pm"
from datetime import datetime
s = "Sat 04 February 2017 12:02:33 pm"
d = datetime.strptime(s, "%a %d %B %Y %H:%M:%S %p")
print(d)
# Output: datetime.datetime(2017, 2, 4, 12, 2, 33)
Usage of "date" from datetime module with use cases
Nov. 13, 2019, 11:48 p.m.
331Usage of "date" from datetime module with use cases
- Date is a reference to a particular day represented within a calendar system.
- We use "date" class to represent a day in calendar in python
- We import date from "datetime" module
How to represent "date" in python ?
from datetime import date
d = date(year=2017, month=4, day=3)
print(d)
# Output: 2017-04-03
How to get current date in python ?
from datetime import date
d = date.today()
print(d)
# Output: we get today's date
How get year, month, day from month ?
from datetime import date
d = date.today()
year = d.year
month = d.month
day = d.day
print("year = %s, month = %s, day = %s" % (year, month, day))
# Output: year = 2017, month = 3, day = 12
How to replace year, month, day in date object and get new date object ?
from datetime import date
d = date(year=2017, month=4, day=3)
# replace year with 2015
old_date = d.replace(year=2015)
print(old_date)
# Output: 2015-04-03
print(d)
# Output: 2017-04-03
# in the same way we can replace month, day
How to convert/parse string to date object python ?
'2017-04-03' >>> "%Y-%m-%d"
from datetime import datetime
s = '2017-04-03'
date_object = datetime.strptime(s, "%Y-%m-%d").date()
print(date_object)
# Output: 2017-04-03
'Monday 03 April 2017' >>> "%A %d %B %Y"
from datetime import datetime
s = 'Monday 03 April 2017'
date_object = datetime.strptime(s, "%A %d %B %Y").date()
print(date_object)
# Output: 2017-04-03
'Mon 03 Apr 2017' >>> "%a %d %b %Y"
from datetime import datetime
s = 'Mon 03 Apr 2017'
date_object = datetime.strptime(s, "%a %d %b %Y").date()
print(date_object)
# Output: 2017-04-03
How to parse date object to string format python?
from datetime import datetime
d = date(2017, 04, 03)
date_string = d.strftime("%Y-%m-%d")
# output: '2017-04-03'
date_string = d.strftime("%A %d %B %Y")
# output: 'Monday 03 April 2017'
date_string = d.strftime("%a %d %b %Y")
# output: 'Mon 03 Apr 2017'
How to convert date object to "datetime" object ?
from datetime import datetime, date
d = date(2017, 04, 03)
datetime_object = datetime(d.year, d.month, d.day)
print(datetime_object)
# output: 2017-04-03 00:00:00
print(type(datetime_object))
# output: datetime.datetime
How to compare two date objects ?
from datetime import date
date1 = date(2015, 12, 8)
date2 = date(2017, 11, 9)
# check if date1 > date2
is_date1_greater = date1 > date2
print(is_date1_greater)
# output: False
is_date1_greater = date2 > date1
print(is_date1_greater)
# output: True
How to get number of days between two dates ?
from datetime import date
date1 = date(2015, 12, 8)
date2 = date(2017, 11, 9)
time_delta = date1 - date2
print("days = %s" % (time_delta.days))
# output: days = -702
time_delta = date2 - date1
print("days = %s" % (time_delta.days))
# output: days = 702
How to get week day from date object ?
Return the day of the week represented by the date. Monday == 0, Tuesday == 1, ... , Sunday == 6
from datetime import date
d = date(2015, 12, 8)
print(d.strftime("%A"))
# Output: Tuesday
day_of_week = d.weekday()
print(day_of_week)
# Output: 1
How to get ISO week day from date object ?
Return the day of the week represented by the date. Monday == 1, Tuesday == 2, ... , Sunday == 7
from datetime import date
d = date(2015, 12, 8)
print(d.strftime("%A"))
# Output: Tuesday
day_of_week = d.isoweekday()
print(day_of_week)
# Output: 2